KeyLimeTie Blog
If you haven't yet seen the Table Connect for iPhone, this is really cool. This caught our eye, especially as we've already reported on how user experiences are becoming fragmented across "contexts." As surface computing emerges (see the Microsoft Surface and the Touch Taste interactive table) the things people are doing now on their computers, smartphones and tablets will truly make their way to the desktop. Check out the video here:

How about an iOS version of the piano from Big?
Some people online report that the table is fake, but the developers at tableconnect.co.cc are spending their time developing additional prototypes instead of debating online critics with too much time on their hands.
Of course, all we can think of is the piano in Big on steroids. Just imagine what would be possible if Apple produced a high-resolution iOS that allowed for multitasking and multiple users on a surface. While researchers are building multi-user, multitouch platforms in the lab, an iOS with these capabilities would bring the benefits of the vibrant App Store and developer community, populating a platform like this with useful apps rather quickly.
Need an app for your company?
If you're interested in building your own app, please let us know. We're happy to discuss how you can use an iPhone, iPad, Android or Windows Phone 7 app to help boost your brand, increase productivity or shave cost. Give us a call at 630.598.9000, and make sure to follow KeyLimeTie on Twitter.
Continuing our recent theme about QR codes, here's a code we found at the end of a TV commercial on Fox. The QR code leads to the Videos tab of a mobile microsite promoting the new Fox TV series Lone Star. There are also tabs for a Homepage, Photos and Cast. No other calls to action exist, other than to check out content related to the show.
The use of QR codes on television allow advertisers to take a difficult to measure mass medium and gain a direct response from viewers, closing the loop on their mareketing efforts. Further, using a QR code on a TV commercial effectively extends the length of the advertisement for the people who click the link, bringing even more value to the advertiser.
Combine QR with a Strong Landing Page
Combine this with a carefully constructed mobile landing page and well written copy and you can capture the interest of easily distracted mobile users. Here are a few things Fox's mobile website does to capture that interest effectively:
- It tells you how long the videos are. This helps because mobile users know they don't have much time. Seeing video times between 75-90 seconds increases the likelihood of someone watching your clip.
- The site itself contains few pages and gets straight to the point. The videos give teasers of the show, Photos capture interest through action shots, and the Cast page shows the actors playing in the show.
- The home page mashes up all three other tabs into one page you can scroll through. If this is the only page you visit, you will still get a cross section of all the content the site has to offer.
Strangely, the site is missing social media links. We can only suppose that this was intentional, perhaps to keep people from clicking away from the mobile site and losing the opportunity to engage with the content they're currently consuming.
We've just begun to see the practical uses for QR codes. Whether on broadcast TV, a web video or another medium, here are a few other applications you can use:
- Mobile App download page (iTunes App Store or Android Marketplace)
- Mobile landing page for your business
- Email List or SMS Alert opt-in form
- A local business listing, such as a restaurant page with reviews, hours, reservation link and address
Have a use for mobile landing pages?
If this article sparks your interest and you see a need for mobile landing pages and QR codes for your company, give us a call at 630.598.9000. And don't forget to follow us on Twitter and like us on Facebook.
Afraid of screwing up with your customers online? If the technical side of the web wasn't daunting enough, it gets more so when you add people. We've all seen and read the stories of communications gaffes that have resulted in an online firestorm for the company. Some mistakes are innocent, while others are arguably a result of poor customer service, PR people not getting it, or even dishonesty that finds the company out. Here are a few examples:
So what about when it happens to you? What if a customer or constituent gets upset by something you do and they decide to take it to the web?
Have you had an experience like this? Maybe yours wasn't on as broad of a scale, but if you've had something like this happen, please take a minute to share what happened and what you did to make it right below in the comments.
Of all of the bold statements made surrounding the iPad, one of the most notable came from Shervin Pishevar at SXSW Interactive. "The laptop is the rotary phone of our generation," he quipped. This quote has been resonating with me over the last couple days as I've been thinking through what tablet computers and tablet apps mean for how we will use computers in the short months and years to come.
On the eve of securing my own iPad, I've been thinking through the types of apps I'll want to load on it for both fun and productivity. Presumably, I'll carry the iPad around more than my three year old MacBook Pro, even though that machine has my most used applications (including Microsoft Office, iWork and Adobe Creative Suite in addition to handy programs like TextWrangler and Omni Outliner). However, the iPad will fill a fundamentally different space. I imagine I'll still use the laptop for the heavy-duty work like long-form writing and building presentations, where I'll mostly use the iPad for things like email, note-taking, social networking, and perhaps video editing.
OmniGraphSketcher for the iPad
But this is a fundamental shift, so my expectations now could end up being very wrong. Given the sophistication of some of the apps I've seen (for example, the
Omni Group's productivity apps), the iPad may well become a primary productivity tool when I'm on the go. In fact, with the proven ingenuity of the iPhone developer community throwing all of its collective creativity at a larger screen, it most likely will.
In essence, the app paradigm scaled up from a phone to a tablet takes the "computer" out of the picture. Purpose-built touch-based apps for iPhone OS, Android and other emerging platforms make both fun and producivity more accessible to the mass market of people who don't consider themselves particularly good with computers—so much so that popular tech pundit David Pogue wrote a comprehensive iPad review from both the perspective of "techies" and non-techies. Pogue says;
"The iPad is so fast and light, the multitouch screen so bright and responsive, the software so easy to navigate, that it really does qualify as a new category of gadget. Some have suggested that it might make a good goof-proof computer for technophobes, the aged and the young; they’re absolutely right."
—David Pogue
The new touch interface, app paradigm, and end-to-end user experience that results creates a new generation of computer users who won't even realize that's what they are. Some day soon we may see the word "computer" disassociated with portable devices; people will still refer to Desktops as computers, but an iPad? "Oh, that's my tablet." The laptop may eventually fade into memory.
So now my question is this; two years from now when I had planned to replace my laptop, will I even want one?
The release of the much-anticipated iPad is fast approaching. Do you know it will explode the possibilities for ways you can interact with your customers and audience?
Now that we as an industry have seen the iPad and developers have had the opportunity to write applications for the new device, thoughts are crystalizing around just how many new ways content creators, publishers, brands, companies and organizations will have to reach, enage and serve their respective audiences.
Discussion at SXSW
On Saturday, March 13 at the South by Southwest (SXSW) Interactive conference, developers, gaming and media executives gathered on a panel to discuss these very opportunities. The thoughts that emerge will enlighten you to the scope of the opportunity ahead. The panel, entitled "iPad: New Opportunities for Content Creators," validated and enhanced many thoughts KeyLimeTie has been having about the device's potential.
Moderator Raven Zachary (@ravenme) set the stage by telling the audience that the pre-launch demand for the iPad is higher than it was for the original iPhone. On the first day of pre-order sales, Apple sold 51,000 iPads within the first two hours, and 90,000 units within six hours (source).
Why such a high demand? Today approximately 75 million people use the iPhone OS (between iPhone and iPod Touch owners) and are familiar with the multi-touch screen interface as well as the App Store. Many of these people will enthusiastically purchase iPads and in the process bypass the learning curve because they're already familiar with how to operate it.
The Panel Discussion
The panelists each gave a perspective on the iPad based on their respective industries. Bill Jensen (@BillyJensen), Director of New Media for The Village Voice talked about the power of the iPad to deliver well-formatted niche content. As one of the few print weeklies that continues to see growth thanks to its local focus, Jensen seemed keen to leverage this lower cost barrier digital format to deliver more niche content.
Jensen made an illustration of the variety of content available through the largest digital medium—the web—and through print distribution channels. A typical city street may have 10 newspaper boxes and the largest of bookstores could carry up to 1,000 magazines while the Internet boasts an almost unfathomable 109.5 million web sites. Being a digital medium, the iPad will bring back the experience of reading elegantly typeset books fused with interactive media, while offering a selection that will dwarf bookstores.
76% of Top 5 grossing apps in the iTunes Store are games. By 2013, panelist Shervin Peshavar (@Shervin) from the Social Gaming Network (creators of best-selling iPhone games) asserted the app market will have an estimated value of $30 billion with approximately 20 million iPads sold. Peshavar disussed the ways the iPad's size and features will change the way people both experience and produce content.
Accoring to Peshavar, the iPad's unique value lies in four distinct factors:
- Screen real estate
- Processing power
- The immersive experience it affords
- Convenient size
"The iPad enables new usage occasions, pushes creative frontier and boosts engagement" says Peshavar. "Greater engagement leads to higher ARPU," or average revenue per user. "The iPhone is more for media consumption, where the iPad will be for media creation," said panelist Jason Grigsby (@grigs). Peshavar even speculated about a radical shift in human-computer interactions, musing that we may now see real-time collaboration between two people using the same device simultaneously.

Photo credit:
Wired Magazine.
Katherine Tasheff (@tasheffka) of Hyperion Books said "The iPad mimics the experience of reading a book like nothing else does." Hyperion, she says, is seeing print book sales decine thanks to e-readers. "[The iPad and e-readers are] the first step toward the virtually paperless society we will be in about twenty years," added Jensen. Underscoring the iPad's potential for ubiquity, Tasheff added "This is the first device both my father and I are excited about. And I am tech support for the man, I know what's involved."
Other Observations
- The potential for two people to collaborate or play a game on the same device instead of two networked devices.
- The iPad will be used fundamentally differently than the iPhone. The panelists are questioning the need for things like camera and GPS because people will treat the device more like a computer than a phone.
- The iPad and similar devices will likely signal the end of vertical scrolling (like on computers) as this is an artifact non-touch screen interface. Now that we are touching the screen, we won't need to scroll. Instead, designers and content creators will be free to build content that pages more naturally.
Looking Ahead
We've only seen the beginning of the possibilities on the iPad. As the iPad launches, think through ways you can better serve your customers by producing an app or re-formatting the content you create for the device. Its popularity virtually ensures someone who does what you do will be competing for peoples' attention on this new screen. This is an opportunity and also a call to action to lead the way for your industry for providing quality interactions via the digital device where many of your customers, users, fans, and audiences will be conducting their day-to-day business, communication, and pleasure activities.
If you have questions about how your company can build an iPad application, or if you are looking for a technology partner with whom to explore this frontier, please call KeyLimeTie at 630.598.9000.
Are budgets tight at your company? Many places they are. Whether you're a Fortune company or a bootstrapped entrepreneur, chances are you have a list of things that you need and are holding off on purchasing something you need to grow your business.
ScaleWell, a new quarterly grant given by entrepreneurs for entrepreneurs, aims to change the way companies look at what it takes to gain traction for one's business. By giving away $1000 per quarter (no strings attached) to one company, ScaleWell is encouraging companies to look at ways to grow by funding their experience. According to the ScaleWell web site, it's a way to enable the recipient to answer questions like "How many customers I acquire for $1000?" or "How much closer to profitability can I get by investing this small amount?"
Moreover, the buzz ScaleWell is generating in the Chicago business community, the region KeyLimeTie calls home, can serve to inspire you no matter your role--whether you are a solo entrepreneur or an executive at a large company. Take a fresh look at the how you can grow or improve by investing a small, finite amount of money to fund an action or make a purchase that will gain you traction. Once you're done, measure the results. Whether this particular experiment succeeds or fails, you're learning along the way and taking positive steps forward.
ScaleWell was founded by Andy Angelos, Ziad Hussain, and Sean Corbett--each entrepreneurs looking to scale their own businesses while helping others do the same. ScaleWell is funded by Trustees; Trustees each donate $100 and volunteer to advise the grant recipient. Trustees decide on the award recipient from applications received each quarter.
KeyLimeTie came to support ScaleWell through CIO Peter Morano. Internally, Pete has been instrumental in applying similar ideas through the KeyLimeTie Labs innovation group, and saw this as an opportunity to lend support to another business in the larger community.
How could you scale your business by investing $1000? You don't have to be a ScaleWell recipient to do this. The budget constraint makes it realistic to think and act this way. With $1000, here are some of the things you could do to scale:
- Video camera to take videos and share online to attract more customers.
- Purchase costly software or hardware that would enable you to do more, or increase efficiency.
- Invest in a graphic design for your company or product that improves your presentation and allows you to sell more.
- Build enhancements to your web site, or purchase hosting for a new web site for a year.
- Hire a C-Level consultant to work with you on strategic alignment within your company or group.
- Sponsor an event that will gain you exposure and put you on the map in a market or a community.
How will you scale your business well? KeyLimeTie wants to know. Leave a comment below!
Update: The first ScaleWell grant was given to Michael Una for his business, Unatronics, that sells handmade electronic musical instruments. He will use the grant money to develop additional products he can sell.
Wednesday's much anticipated iPad tablet device appears to be, at first glance, a scaled-up version of the iPhone. By releasing the iPad, Apple is carving out a new category of device, and a new way people will interact with computers. Over the past ten years, there have been many unsuccessful attempts at building a widely-adopted tablet PC, so of course there is skepticism.
It would be easy to dismiss this device as nothing special, before considering how the App Store made the iPhone and the iPod Touch into the outstandingly popular devices they are today. At this point, we've just seen what Apple (and a select group from the developer community) have done with the iPad. The real applications are yet to come, thanks to the limitless creativity of the iPhone—and now iPad—developer community, including companies like KeyLimeTie.
Further, industry reporters like TechCrunch's MG Seigler explain why the iPad will succeed; its target audience is the 75 million iPhone and iPod Touch users. These people will know how to use the iPad right out of the gate.
Those people will also grow more and more accustomed to a web you can touch, with full web pages now practical on the iPad screen and people used to pinching and swiping their way around your site, making purchases, downloading documents, playing games, writing comments. The web as we know it will evolve, interaction design will shift, as the iPad and other tablets capture our share of screen time.
Many people will opt to leave the laptop at home and use the iPad for communications, eBook reading, entertainment, and even productivity when larger screens and computing power aren't required. But, imagine for a moment a restaurant menu displayed on an iPad, or iPads being used to process transactions in a retail store. That, of course, is just the start.
Here are some others' thoughts on the iPad's viability:
Like what we have to say? Follow @KeyLimeTie on Twitter or join our Facebook fan page for continued updates.
This past weekend I attended a lunch presentation by Andrew Mason of Groupon, which he gave to participants in the Chicago Urban League's NextONE Program. I was invited by great friend John (JR) Dallas and welcomed by the Urban League staff.
Groupon, if you are not familiar, is widely regarded as Chicago's biggest tech success story of 2009. The web site allows people to buy one steeply discounted offer each day, provided enough other people also buy the offer so the retailer has a critical mass of new customers. They achieved profitability in the spring of 09 and closed on $25mm of funding late last year, when they admittedly didn't need the money. Now they're on target for $100mm in revenues in 2010.
It would be too easy to romanticize the above. Mason and team came up with an idea, coded it in a month, and it became a runaway success. But just looking at their history as Groupon would be denying some important lessons about innovation and persistence.

ThePoint.com: Groupon's predecessor.
Before Groupon was (and still is) a web site called The Point, which Mason started in 2006. The Point was built to allow people to achieve critical mass on a political or social issue before taking action, to ensure the action they take (a donation, a protest, a mass action, etc) had an impact. The site itself didn't take off to the founders' expectations because of a lack of focus; they were providing a platform for an undefined audience to take action on any potential issue.
The software and concept that powered The Point now powers Groupon. In late 2008, the team worked for a month to get the product off the ground, with very limited features and simple e-commerce capabilities, and the new, focused idea stuck.
Among many others, I was able to pull these lessons from Andrew's talk and knowing the Groupon story:
- They weren't afraid to act, try something different, and risk failure. Groupon was a 30-day diversion from working on The Point. If it failed, they wouldn't be out a lot of time, money, or emotional investment.
- They took an existing asset, the software engine powering The Point, and applied it in a different way. They learned that this new application had considerably more monetary value than the original.
- Mason and the team continually improve Groupon by creating a product they themselves want to use, and add features and improvements based upon problems they themselves have. Their philosophy, "If I have this problem, chances are someone else does, too."
Take a look inside your business as we take a look inside ours. Do you have the opportunity to "Pull a Groupon?" Perhaps you have software systems that are built for one purpose that you could refactor for a different one, or maybe you could deliver your services to a completely new audience. Chances are you are creating a product or service right now that could either make better use of by-products created or could be applied in a completely different way.
If this article strikes a chord with you, please let us know in the comments. If you see successes from "pulling a Groupon," please let us know (and Groupon too, I'm sure they'd appreciate it)! Finally, if there is an opportunity for KeyLimeTie to assist developing the software needed for you to accomplish your goals, please drop us a line.
In the aftermath of the Haiti earthquake last week, amidst the rescue effort headlines, is a robust discussion in the digital marketing industry around the power of text message campaigns to quickly mobilize people while creating an audience. At one week after the earthquake, the American Red Cross's text message (SMS) campaign alone has raised over $24 million dollars for the relief effort.
The simple campaign asks people to text the word "Haiti" to short code 90999. Once the user answers the confirmation message, a $10 charge is added to their mobile bill that month. After you confirm, you're again prompted; this time, asking if you would like to receive Red Cross alerts straight to your mobile phone.
That's right. The Red Cross raised $24 million dollars in one week from 2.4 million individual $10 donations by people with mobile phones. Why did it work? The message got out when the disaster was getting the most coverage and offered donors instant gratification in donating via an unprecedentedly simple method.
The campaign itself is viral because it's short, timely, memorable, and actionable. You can easily tell someone "Text "Haiti" to 90999 to donate $10 to the relief effort" in a text message, tweet, status update, a phone call, or an email. Using text and social networking technology, the message has potential to spread exponentially, and this one did. As a result, expect to see more charities and relief agencies using SMS for fundraisers when time is of the essence.
The larger lesson for businesses amidst the tragedy that prompted the campaign is great. Text messaging campaigns have taken the "impulse buy" and freed it from the four walls of a retail store. Now people can respond to an ad campaign, make a quick purchase, or make a quick donation right where they are, with the same convenience of chatting with a friend.
Further, you can build opt-in lists and notify people of promotions, sales, or send news alerts that will reach them instantly in the future. Many short code providers have CRM systems so you can manage customer relationships and even integrate their text profiles with their online profiles in your main e-commerce or CRM system.
While SMS short codes have been around for years, the tragedy in Haiti is being marked by the industry as an event that has now proven the critical mass--and the effectiveness--of SMS response campaigns. If you are curious about ways your business can utilize short codes and integrate them with the rest of your digital strategy, talk to KeyLimeTie. Or, if you'd like to read up on short codes, see this informative article on GigaOM.
Last week I found this decal on a store front while in San Francisco. After searching the web to learn about the program, I learned Google is focusing more and more on local business and location-based search as a new revenue stream, and improving how companies advertise their businesses.
Google launched a pilot program where they sent out 100,000 of these window decals to the most popular local businesses listed on their web site. The stickers contain a QR code (short for "Quick Response") so passersby can snap a quick photo of the code and visit the Google Local listing for that company. There they can find business information and aggregated reviews.
This helps people learn more about the businesses they walk by every day. They might find a copy shop or a café, and be able to see what others think about the place before they buy. Or, they could save information about a location for later, or share with a friend, by sharing the local search link that comes up in their phone's browser.

Google Local result for QR code.
What does this mean for small businesses? It means people will be looking up your company more and more on their phones. Here are two excellent ways to ensure they get the best information they can about you:
- Sign up for and update your Google Local Business Center listings to add custom information to your local search listings, including local coupons. Use this also to analyze who is searching for your business and where they are located, to aid in your marketing efforts.
- Make sure your web site is mobile-optimized. The best way to do this is to have your web development firm build a mobile stylesheet for your web site. With a mobile stylesheet, people visiting your site via their phone's browser will see all of the text and images optimized for the small browser. Mobile web sites are specifically designed to present relevant, location- and time-sensitive information to people seeking you via their phones.
If you would like KeyLimeTie to optimize your web presence for mobile, or if you have questions about Google's Local Business Center, give me a call or reach out to @KeyLimeTie on Twitter. We'll be happy to help.
The last song on the radio, as I was parking my car today, was Turning Japanese (bonus points if you can name the artist without using Google). This is not the ideal song to have running through your head on a workday, or pretty much any other time. So in an effort to dispel the demons of eighties novelty pop music from my tortured brain, I thought today was as good a time as any to write about the Google Web Toolkit (GWT).
GWT is a framework for writing AJAX interactive web applications. Take a look at the GWT home page for lots of great information about GWT. I'm going to look at this from the point of view of answering the key question - "Why GWT"? In other words, out of all the AJAX frameworks out there, what compelling reasons are there to choose GWT?
Write Java code, not JavaScript
One of the most unique features of GWT is that you code in Java. While in development, you can run your Java code in a special "hosted mode" browser with all of the mature, sophisticated Java debugging tools you always have available. For deployment, this Java code is run through a compiler which translates this Java code to cross-browser JavaScript.
Although this is significant, especially in shops with a lot of Java experience, I don't believe this is a truly killer feature of GWT. The important thing is that the developer is shielded from the complexities of writing JavaScript code that works in all browsers. Other AJAX frameworks provide a layer of JavaScript that provides this abstraction. GWT does this by translating standard Java code, but how this is abstracted is less important than the simple fact that it is abstracted.
Simple RPC mechanism
GWT has its own mechanism for remote procedure calls between the browser and the server. This works a lot like vanilla Java RMI - define interfaces and implementations for your server functions, and code will be generated to allow your client code to call them as if they were simple local methods. This blows away all the work of defining XML or JSON data formats for requests and responses. Just code the function for the server, call the function (still in Java) from the client code, and all of the marshalling, unmarshalling, network communication, etc. is done for you.
GWT also supports other RPC methods, such as XML or JSON over HTTP. You can write server code as standard SOAP or REST web services and GWT will be able to use them directly. This will take longer to code than the GWT RPC mechanism, but allows you to use legacy web services or services intended to also be used by non-GWT clients. A good middle-ground approach may be to prototype using GWT RPC, and re-implement as web services later.
Optimized cross-browser JavaScript
One of the key reasons to use any AJAX framework is to write code that will produce identical results on all browsers, which is fiendishly difficult in raw JavaScript. Why roll your own code to handle all of those quirks, when a bunch of framework authors have already done it for you? GWT handles this, just like any good framework. As a bonus, the generated JavaScript is automatically compressed and obfuscated, decreasing download size and making reverse engineering more difficult.
Download sizes are kept even smaller by a unique GWT feature, namely...
Deferred binding
Writing code that works on any browser is great. But there is a cost - the framework has to include code for every browser. You may be viewing the page in IE, but you also downloaded code specific to Firefox, Safari, and Opera, which will never be executed. Deferred binding is the GWT solution to this dilemma.
The first step in deferred binding happens during compilation. The GWT compiler creates a different JavaScript file for each browser type, containing all of your application code optimized for just that one browser. A very small bootstrap script is also generated, which examines the execution environment and determines which browser-specific application script to load. The bootstrap script is included in a standard HTML page, and when the page loads the script will resolve, download, and run the correct application script.
There is also provision for taking advantage of caching. The bootstrap file is intended to never be cached (and is helpfully suffixed with "nocache.js" as a reminder). The implementation scripts are intended to be cached, and have file names that are a hash code of their contents. This means that if you re-compile, but there are no changes in a particular file, it will have the same name and will allow the browser to use the already cached version. If the file changes, it's filename will change and therefore force a reload when it is next requested. All of this is done without any developer help (the generated bootstrap script deals with the filenames automatically). Pretty neat, huh?
I've only talked about deferring binding based on browser type so far, but this is a general mechanism and you can define other context-dependent variances as well. The most common use of this is...
Localization
Localizing a GWT application is very straightforward and very similar to how it is done in standard Java applications. In the simple case, just create properties files with locale suffixes, access them by key through a GWT interface, and the locale-specific string will be used. The cool part is that GWT uses the deferred binding mechanism to make sure that only those properties for the current user's locale are ever downloaded.
As with browser-specific code, this starts at compile time. For each combination of browser AND locale, an application script is generated. So, if you have, say, 4 browsers and 3 different locales, you will have 12 files generated - such as FireFox in English, FireFox in French, IE in English, etc. The bootstrap file will examine the client environment and load the application file specific to the detected browser and locale.
Image bundles
It should be obvious by now that the Google team put a lot of effort into ensuring the highest possible performance. Image bundles are yet another performance-boosting feature of GWT. Defining an image bundle allows the GWT compiler to package a number of images into a single file which is accessed through a Java object. This reduces the number of network round trips by getting all images for your application in a single file download. The packaging is done by the GWT compiler; all the developer has to do is define an annotated interface with a function to access each image.
Embeds well in existing sites
Some frameworks work best only if they are in charge of the whole page. GWT "plays well with others", in that it can generate entire pages or elements within a page equally well. The GWT scripts work by creating HTML elements inside a specified element on a page during the onLoad() event. To embed a GWT application in an existing page, you therefore only need to include the script file and add an id to the element you want to contain the application (which can be a div, a td, the body, anything). Couldn't be easier.
Conclusion
That's a lot of good reasons to choose GWT, and that's just scratching the surface. There are many other features that make GWT a compelling choice. These include interoperation with native JavaScript, intelligent support for back button navigation within an application, accessibility support, programming delayed logic, support for JUnit testing, availability of third-party widget libraries, and more. GWT is also under continuous development, with the upcoming 1.6 version to include improvements like a faster hosted mode server, faster string handling, better compiler performance, and easier deployment to standard JEE WAR files. The Google team has done a very good job of providing a no-compromise framework that provides a fast, rich, and consistent user experience while keeping the developer focused on the application rather than the technology. It's definitely worth taking for a test drive.
Now, if I could just get that song out of my head...
If your server and website are not using HTTP Compression, you're not taking advantage of one of the easiest website performance features to implement. This blog tells you how to enable HTTP Compression in less than 10 minutes and reduce traffic by as much as 85%! The instructions below are a combination of articles I've read online and in print. We have implemented this on at least a dozen servers which host hundreds of websites with only one issue (mentioned below; issue with PDFs).
Create Compression Folder
- Create a folder where the compressed file will be cached. You can give it any name or leave the default: "%windir%\IIS Temporary Compressed Files".
- Grant write permissions to IUSR_{machinename} for the folder.
Enable Compression in IIS
- In IIS, right-click on the "Web Sites" node and click "Properties".
- Select the "Service" tab.
- Check "Compress application files". (we have seen issues where PDFs are compressed and cannot be opened)
- Check "Compress static files".
- Change "Temporary directory" (if you created your own folder).
- Set the "Maximum temporary directory size" to something that the hard drive can handle (i.e. 1024).
- Save and close the "Web Site" Properties.
Create a Web Service Extension (WSE)
- In IIS, select "Web Service Extensions".
- Add a new web service extension.
- Name it "HTTP Compression".
- Point it to "c:\windows\system32\inetsrv\gzip.dll".
- Check the "Set extension status to Allowed" to enable it.
Edit IIS Metabase
- In IIS, right-click on the server node (top level) and click "Properties".
- Check "Enable Direct Metabase Edit".
- In Notepad, open the metabase: C:\Windows\system32\inetsrv\metabase.xml
- Search for "<IIsCompressionScheme"
- There will be two of them, one for deflate and one for gzip.
- In "HcScriptFileExtensions", add aspx, asmx and any other extension that you need to the list already there. Do this for both deflate and gzip and format the format.
- Change "HcDynamicCompressionLevel" to 9. Do this for both deflate and gzip.
- Restart IIS
Migrating code to a Production environment is not a difficult task, but I have seen developers do some weird stuff. I have seen developers migrated code compiled for debug and then wonder why the site doesn't run very fast...not common, but really does happen. I have seen developers migrate the entire source code for a project to the Production environment. They call it their "back up" location. Ever hear of VSS? These are bad practices that should be avoided.
Have you ever noticed that code you build for Release or Publish still generates a PDB file? Up until recently I really didn't think much of it, but one day I got curious. I did a little research and found a great blog on compiling options and the implications. In short, you can disable generation of the PDB file (see image below), but it's not recommended. Read this article for more details:
http://blog.vuscode.com/malovicn/archive/2007/08/05/releasing-the-build.aspx
From "ReadWriteWeb"...
Microsoft's next-generation web browser, Internet Explorer 8, has arrived. In a surprising move, after the demo of IE8 and its new features at today's session of the MIX08 conference, the startling announcement was made: "It's available for download now". The new browser showcases many new features and improvements, like Facebook and eBay integration, standards compliance, and the ability to work with AJAX web pages. What's most notable about IE8, though, is more than a sum of its parts. If anything, this launch shows that Microsoft is not taking Firefox's creep into browser market share lightly.
IE8 New Features Shown At MIX08
Standards Compliance
There were hints that IE8 would be a remarkable offering on the IE Blog as they released tidbits about the browser's capabilities. For example, the announcement of IE8's passing of the Acid2 test (a test for standards compliance) marked a milestone in IE8's development. The standards mode was originally going to be turned off by default letting web developers code for it by including a "meta" tag to make use of IE8's new standards compliant mode. Later, Microsoft came to their senses and made the default the standards-compliant mode. Meanwhile, Firefox also claims to have passed the Acid 2 test, but an open bug on bugzilla.mozilla.org seems to say otherwise. One commenter on the thread notes, "So, we essentially do pass the test. However, in some situations, it might still fail, that's why this bug is open."
Facebook Integration
With a Flock-like feature as an unexpected surprise, Microsoft capitalized on their partnership with the popular social networking site, Facebook, to allow IE8 users the ability to get status updates from Facebook right from their browser toolbar.
eBay Integration
Like Facebook, this feature also uses IE8's new technology, called "WebSlices", which introduces a new way to get updates from other sites via the browser itself, without having to visit the web site. With WebSlices, IE8 beta users can subscribe to portions of a page that update dynamically, in order to receive updates from that page as content
changes. eBay will offer webslices, too, letting you track your auctions from the browser toolbar. Basically, WebSlices look like Favorites on your Links toolbar but they have a little arrow next to them - clicking on this arrow will show you a small window of live web content.
Live Maps Integration
Another WebSlice was integration with Live Maps. It appeared that you could even highlight text on a page, like an address, and then right-click and choose Live Maps from the context menu to get a WebSlice preview of that location on a map in a small pop-up window.
Integration with Me.dium
Me.dium integration will be supported in IE8 via WebSlices. Me.dium will now help web surfers discover and view WebSlices directly from the sidebar. The Me.dium sidebar will alert users to the presence of WebSlices on any page – and even allows users to read each WebSlice, without leaving the Sidebar. In addition, Me.dium will make real-time recommendations for other WebSlices on other relevant web pages and provides direct links to them based on the real time activity of other Me.dium users.
Working with AJAX Pages
IE8 will offer better functionality when it comes to AJAX web pages. The example showed a page where you could zoom in using AJAX technology. Previously, hit the IE "Back" button would take you back to the last page you were on. Now, "Back" will zoom you out.
We can now find out what other features IE8 has to offer, since the beta is now publicly available for download. To get IE8, you can download it from here:
http://www.microsoft.com/windows/products/winfamily/ie/ie8/readiness/Install.htm.
I see these "Top 10 Ways..." articles all the time, but this one was the first in a while that didn't restate what all of the other ones talk about.
Website: http://www.sitepoint.com/article/aspnet-performance-tips
1. Determine what to optimize
Discusses quick, simple techniques such as tracing.
2. Decrease the size of the view state
This one really got my attention. ViewState is so powerful, but can kill your website too. With AJAX gaining so much momentum, ViewState compression is a must. The article even gives you the C# class! Storing ViewState on the server can also be a great technique.
3. Decrease the bandwidth that my site uses
HTTP Compression has been around for several years. With IIS 6.0, it requires no 3rd party controls or custom code. Why wouldn't you enable it today?
4. Improve the speed of my site
Output caching can improve site speed very quickly and easily. Nothing new here.
5. Refresh cache when the data changes
Depending on how you bind data to objects or store cache, this tip may or may not apply to you. But definitely worth reading.
6. Gain more control over the ASP.NET cache
Using the Cache class/object is a great technique, but only when it makes sense...do not overuse it.
7. Speed up database queries
We have helped many clients speed up their website throughout the years. Evaluating and optimizing the database is one of the easiest and best bang for your buck approaches. Just run down the list: Indexes, Stored Procs, Views, locking, etc.
8. Troubleshoot slow queries
A quick guide to understanding execution plans.
I came across an interesting set of principles that you might want to keep in mind the next time you set out to design an application, a website, or even improve your daily life. They are The Laws of Simplicity and were conceived by John Maeda, an artist and noted computer scientist from the MIT Media Lab. He compiled them in a short, 100-page book (and posted them on his website as well). I found them in a back issue of Wired magazine, in an article that applied them in a critique some new gadget. I have since found that they increasingly influence my own analysis of UIs and websites, and occassionally use them as the basis for discussions with clients to keep a design session on track.
The Laws are:
1. Reduce - The simplest way to achieve simplicity is through thoughtful reduction of functionality.
2. Organize - Organization makes a system of many appear fewer.
3. Time - Savings in time feel like simplicity.
4. Learn - Knowledge makes everything simpler.
5. Differences - Simplicity and complexity need each other.
6. Context - What lies in the periphery of simplicity is de?nitely not peripheral.
7. Emotion - More emotions are better than less.
8. Trust - In simplicity we trust.
9. Failure - Some things can never be made simple.
10. The One - Simplicity is about subtracting the obvious, and adding the meaningful.
You can find a more detailed explanation of each law on his site
www.petermorano.com
Within the past few days, Google has released a new feature to its Google Maps website. This new feature, Street View, is only available in a few cities like New York, Miami, Denver, San Francisco and Las Vegas. Not sure how it can be useful, but it is kind of cool. Just zoom into the street level and click on the blue outlined street. You will then see a photo view taken from a car.
View Las Vegas Street View
Screenshot:
Some cool websites we've recently found...
kuler
http://kuler.adobe.com
"Explore, create and share color themes. Use it online or download themes to use
with Adobe CS2 and 3."
Basically, it's an interactive color theme designer. Download any of the 1,000+
themes which are rated by visitors, or creeate your own.
Simply Google
http://lloydi.com/blog/simplygoogleoriginal.htm
The owner of this website basically put most (if not all) Google features and websites
into one easy to use page.
SitePoint Contests
http://www.sitepoint.com/contests/
Over the past 10 years, we have tried all kinds of approaches to web design. Some
worked out, but most do not. Here's the pros and cons of what we have tried and
some comments on each approach:
- Professional designers
Pros: Original work. Direct communication with designer.
Cons: Expensive. Only get ideas from one designer, not a pool or designers. Depending
on the person, their availability might be limited.
Comments: Unless the customer specifically requires it, we will not use a designer
unless we know the person very well and he/she has consistently developed excellent
work and is good to work with. This means he/she needs to stick to the project schedule,
be available during core business hours and keep us up to date with progress.
- Template websites
Pros: Very cheap ($50-$70 for Flash, PSD, HTML & CSS). 1,000 of designs to choose
from. A lot of the designs are very professional.
Cons: Your website design will not be unique.
Comments: For customers who have a decent budget and need a unique design, this
is not an option. Templates work very well for businesses where the budget is tight
and the website design doesn't have to be unique. And realistically, what's the
chances of you seeing your website somewhere else...maybe 1 in a million? Also you
do get the PSD, so you can easily modify parts of it. We still use templates all
the time. When we choose a template, we buy it for the overall look-and-feel. Most,
if not all, of the images are swapped out with our own. And of course the logo is
replaced with ours.
- Crowdsourcing websites
Pros: Affordable (Price amount + $25 flat fee or 10%). Unique designs. Designers
fight over prize money.
Cons: There's a chance only a few designers may participate in your contest and
you won't have anything good to choose from.
Comments: We have been using
Design Outpost for a few months now (posted 6 projects). For the most part,
it has worked out well until my most recent project received only one entry the
day before the project was supposed to end. We had to cancel it. Then we found SitePoint. So
far, it's been great, but we have only posted two projects...a logo project that
has received over 60 entries so far (ends tonight) and a template project that has
received only 2 entries (but doesn't end for 3 for days).
A few months ago, I renamed my company from New Vision to iArchitect. With any business name change comes the need for a new logo, new stationary, a new website and new business cards. Fortunately, I have been doing this long enough to create great relationships with excellent designers and also found great resources for one off jobs here and there.
Today, I came across a great blog on "Cool Business Card Designs":
http://creativebits.org/cool_business_card_designs
I am not a very creative person, but the people who designed these business cards sure are.I thought I might share it with those of you who have businesses or are planning to start a new business. These kinds of touches can make a nice first impression.
I have been doing a lot of work with the Microsoft AJAX Toolkit lately.
Most recently, I have implemented it on a major insurance website I am always adding features to.
I am also updating the iArchitect CMS to be 100% AJAX-enabled. Very soon, you will see it on this website.
If you're not familiar with AJAX, here's the
Wikipedia definition.
Basically, it's the process of updating small portions of a webpage instead of refreshing the entire page. For example,
when you leave a comment to a blog on this website, instead of refreshing the entire page, I can now just refresh the comments area.
This presents a much more enjoyable user experience and speeds up the website.
When a portion of a webpage is being updated, you will see an animated image indicating so.
While working on these websites, I found the need to create custom animated images.
These are not easy to create...until now.
Visit the following link and you can create your own animated AJAX image in seconds!
http://www.ajaxload.info
A very interesting article on Google PageRank:
http://searchengineland.com/070426-011828.php
1. Tight writing. That doesn't mean bad or easy writing.
2. Copy of about 600-800 words is better for SEO and catching the long tail of search.
3. Title – Subject – Support, in that order, like subject, verb, object.
4. Titles should be snappy and informative – clickable, but clear.
5. Leads (first sentence or paragraph) should get to the point. Tell the reader what the article's about first thing.
6. No fancy, wordy intros where it's not clear what you're talking about.
7. Information beats fluff every time. Pretty is for books and newspapers (and only sometimes).
8. Information does not beat style every time. Style keeps people awake.
9. Sans serif fonts are easier and faster to read on computer screens.
10. White space is awesome – even better than big, pretty pictures.
11. Content should be scannable.
12. Think in bullets and subtitles.
13. People like lists.
14. Pictures should be specific and informative, not generic, decorative and ad-like.
15. Photos should be relevant to content.
16. People in pictures should look friendly and approachable (and have their whole head).
17. Photos should be full body if possible.
18. Spell stuff right. It makes you look smarter.
19. Grammar IS important. Unless you're not really a professional.
20. Online press releases should be even tighter than Web copy.
Source: WebProNews
Version 4.1 Released with all of the features you asked for and much more! Generate Google, Yahoo and HTML Sitemaps....and now RSS Feeds!
Version 4.0 was released only one and a half weeks ago, but I've received a lot of great response.
Because of all of the great feature requests, I decided to jump on them right away and get them out in a v4.1.
Download it today and check it out.
Download - Try it today for FREE!Download Sitemap Generator v4.1New Features in v4.1:- Preferences: Ability to specify the spider start directory.
- Preferences: Ability to disable advertising
- Preferences: Last Modified Date has been overhauled! Instead of just specifying a default date, your options are:
"Use server's modified date" - While spidering, the application will read the webpage headers to get the last modified date (if available). This is the best option.
"Use today's date" - Especially useful when your server doesn't supply the last modified date and you want to schedule your updates.
"Let me specify the date" - Some people might still want to specify a default date.
- Ability to export to RSS: Another format to publish your website! RSS is very popular and getting more attention everyday.
A friend of mine came across a problem trying to stream a FLV video on his customer's website. Before deploying it to their web server, he tested it successfully on his local machine and on a UNIX test server...no problems. Once he migrated the code and FLV file to the Production Windows 2003 Server, it didn't work anymore. He figured out the problem and told me about it.
Issue
When Flash Player movie files that stream external FLV files (Flash videos) are placed on a Microsoft Windows 2003 server and then viewed in a browser, the SWF file plays correctly, but the FLV video does not stream. These files work correctly if tested on other operating systems. The issue affects all FLV files played via Windows 2003 server, including files made with the Flash Video Kit for Dreamweaver MX 2004.
This TechNote describes the steps necessary to allow Windows 2003 to stream Flash Video files.
Note: These instructions are provided as a courtesy for customers and address the issue in Microsoft Internet Information Services (IIS) 6.0 rather than Flash.
Reason
With IIS 6.0, Microsoft changed the way streaming media is handled. Previous versions of IIS did not require any modification to stream Flash Video. Microsoft IIS 6.0, the default web server that ships with Windows 2003, requires a MIME type to recognize that FLV files are streamed media.
Solution
Please be aware that these steps do not resolve any issue with Flash, but are a configuration step for Microsoft Windows 2003 and Microsoft IIS Server 6.0. Any difficulties in executing these instructions or any errors that may arise from modifying your system settings should be addressed to Microsoft. For more details, please refer to your IIS documentation.
1. On the Windows 2003 server, open the Internet Information Services Manager.
2. Expand the Local Computer Server.
3. Right-click the local computer server and select Properties.
4. Select the MIME Types tab.
5. Click New and enter the following information:
Associated Extension box: .FLV
MIME Type box:flv-application/octet-stream
6. Click OK.
7. Restart the World Wide Web Publishing service.
Source
http://www.adobe.com/cfusion/knowledgebase/index.cfm?id=tn_19439
Announced this morning on Google.com...
As you may know, for every $1 you spend on AdWords, you can process $10 of Google Checkout sales for free. Just in time for the holidays, we're giving you even more by processing your Google Checkout sales for free through the end of 2007! Here's how it works:
- From November 8, 2006 through December 31, 2007, we'll process your Checkout transactions for free, even if you aren't an AdWords advertiser. If you're already an AdWords advertiser, we'll process your Checkout transactions for free regardless of what you spend on AdWords.*
- Valid Checkout orders you receive during the promotion will automatically qualify.
- You can take full advantage of this promotion by encouraging your buyers to use Google Checkout on your site.
- Other applicable fees (e.g. chargeback fees) may apply. This promotion is subject to the Google Checkout Terms of Service. Google may revoke the promotion for accounts that do not comply with these terms.
On January 1, 2008, the standard transaction fee will apply again. Also, if applicable, your regular free transaction processing (based on your December, 2007 AdWords spend) will resume.
Using Google Checkout to increase sales and lower costs during this busy holiday season has never been easier. If we can do anything else to help, feel free to drop us a line. Happy holidays from Google Checkout!
* AdWords advertisers: Because this promotion begins on 11/8/06, the free transaction processing based on your AdWords spend will still apply to your Checkout sales from 11/1/06 through 11/7/06. Any Checkout orders you receive and process from 11/8/06 through 12/31/07 will then be eligible for free processing under this promotion.
From a WebProNews email I received recently:
Important SEO Tips Everyone Should Know
Chris Richardson | Staff Writer
During the Interactive Site Review session at the 2007 Las Vegas PubCon, various site owners submitted their site for the panel to pick apart. The panel consisted of heavy hitters like Matt Cutts from Google, Tim Mayer from Yahoo, Greg Boser from WebGuerrilla, and Danny Sullivan.
While some of the sites they reviewed may have been lacking in certain departments, the knowledge the panel bestowed is quite valuable for SEOers of all types. What follows is are some quotes and paraphrases that go a long way to demonstrating what it is search engines are looking for:
- each page of your site is an entry point, optimize (title tags, keyphrases) for what each page targets - Greg
- strive for quality links over quantity links - the entire panel
- if you are targeting your site geographically, get links from local entities (Chamber of Commerce, local directories)
- unique content is important (this and link bait are the prevailing themes of the Las Vegas PubCon)
- if you can get into the top 3 of Google Local, you will be on the front page of Google's standard search if the query is geographically based... - Danny Sullivan
- when optimizing for Google Local, navigate to the Google Local Business Center - this was suggested by Matt as a source to assist with being indexed by Google Local's index as well as a place to claim your business, similar to Technorati's blog claim function.
- one of the sites reviewed was a real estate site... during this portion, Greg revealed some interesting information about how this industry markets to the search industry: the real estate industry conducts SEO much like they did in 98, it's a bad field in reference to SEO...and while this may not be a tip per se, it's good information to be aware of especially if you are considering this industry...
- ditch javascript menus altogether... they are a red flag to ranking algos - Boser
- template-based sites may not rank well because they appear alike to the crawlers... - Tim Mayer and Matt Cutts both iterated this thought.
- Session IDs urls need to be blocked from crawlers because of duplicate content issues (don't serve session ids to bots)... this was emphasized by Matt who said: "session ids can be poison for crawlers"
- if your site sells manufactured products, don't use manufacturer copy... use your own descriptions - this was also stated by Heather Lloyd-Martin during the effective web copy session
- there's no good use for 302 redirects, ever - the entire panel
- blog about your product or target area, this provide so much of the original content the search engines are looking for - paraphrased from Matt Cutts
This last point plays into the whole link bait theory that was incredibly prominent during this conference (I cannot count the number of times I heard this phrase...). Keep these tips and ideas in mind when you are conducting any SEO or SEM-related process. They will serve you well.
Learn step by step how to create a web layout with Adobe Photoshop.
Really amazing and useful for web design beginners.
http://www.13dots.com/index.php?categoryid=33&p2_articleid=65
Over the past few years, I have worked with companies that have used offshore resources to work on part of or complete projects. I also have been "hands on" with some of the projects and in my experience, it rarely works.
I was recently in a meeting discussing the possibility of offshoring some work. Since I have been down this road all too many times, I voiced my opinion. Most people also didn't like the idea and I took it upon myself to gather some facts. Now I know there are arguments for both sides, but when I was browsing the web I came across a study by Gartner Inc.
If you're not familiar with Gartner Inc., they are pretty much the authority on gathering, analyzing and reporting statistical data in an unbiased manner. Companies all over the world utilize Gartner's services to gather data on pretty much anything...and because they are so thorough, it's not cheap.
The 5 high level reasons are:
1. Unrealized cost savings
2. Loss of productivity
3. Poor commitment and communications
4. Cultural differences
5. Lack of offshore expertise and readiness
Click here read the entire article
I found this great article on the "18 Mistakes That Kill Startups", by Paul Graham.
I have been a part of several start ups, some successful, some not so successful.
A lot of the points Paul makes I have seen first hand.
I learned a few really good things from his article.
http://paulgraham.com/startupmistakes.html
A few days ago, I blogged about how to find database credentials using Google Code Search.
Here's another interesting search:
Google Code Search supports regular expressions...try:
^\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$
21,600 results with is probably at least 30,000 email addresses.
I was messing around with Google Code Search today and decided to see if people were dumb enough to publish web.config files with their private database credentials.
Yep...there are a few out there. Check out this example:
http://google.com/codesearch?hl=en&lr=&q=database+file%3A%22web.config%22&btnG=Search
Today, Google announced "Google Code Search":
http://www.google.com/codesearch
"It's a site that simplifies how software developers search for programming code to improve existing software or create new programs."
The Basic Search is OK, but you really need to check out the Advanced Search.
Here's you can specify a few things, most importantly the language.
In the example below, I search for "PayPal" in C# code.

Results return in code snippet format. Check out result #11...that's some amazing code!

Click into each result and the entire code file is displayed with links to related project files.
A few months ago, I started receiving the following exception on my website for one of my blogs:
System.Web.HttpException: The state information is invalid for this page and might be corrupted. ---> System.Web.UI.ViewStateException: Invalid viewstate
Immediately, I visited the webpage and it displayed fine. So I refreshed a bunch of times without receiving any errors. Then I tried viewing and posting comments with several browsers...nothing wrong. Then I opened up the code and reviewed it...it looked good. Since it was only happening to one blog, I thought it might have to do with the content of that blog...maybe I accidentally dropped a "<form>" or "<__VIEWSTATE>" tag in the content...nope! I was puzzled and kind of dismissed it for awhile.
Then it started happening more and more...on average, I now receive 2-3 of these errors per day. I started getting a little worried that a handful of visitors couldn't post comments to my website...but I had no ideas what to do.
Then I got a call from a customer. They purchased some blog spamming software and asked me to come to their office, figure out how it works and show them how to use it. Before going there, I researched the product and that's when I figured out my problem. The software I was reading up on asks you what keywords to search blog sites for. Next, it asks you for a blog comment. Finally, it searches the blog sites with the keywords you entered and posts the comment you entered. Not only does it get people to read your comment and visit your website, but it also increases your Google "PageRank" over time...genius! But be careful...if Google sniffs this out, you'll be blacklisted.
This is what is happening to my website...people are running software programs to leave spam comments and are causing exceptions to be thrown. The good news is that I now know it's now my software causing the problems and that I'm successfully blocking a lot of spam thanks to my Captcha code.
Below is the full exception message...I'm posting it so that it is indexed is search engines and others having this problem will now have resolution.
Page: Global.asax
Method: Application_Error
Exception: System.Web.HttpException: The state information is invalid for this page and might be corrupted. ---> System.Web.UI.ViewStateException: Invalid viewstate.
Additional Details:
Client IP: 65.88.129.2 Port: 46644 User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 1.1.4322) ViewState: sssssss Referer: Path: /BlogEntry.aspx ---> System.FormatException: Invalid length for a Base-64 char array. at System.Convert.FromBase64String(String s) at System.Web.UI.ObjectStateFormatter.Deserialize(String inputString) at System.Web.UI.ObjectStateFormatter.System.Web.UI.IStateFormatter.Deserialize(String serializedState) at System.Web.UI.Util.DeserializeWithAssert(IStateFormatter formatter, String serializedState) at System.Web.UI.HiddenFieldPageStatePersister.Load() --- End of inner exception stack trace --- --- End of inner exception stack trace --- at System.Web.UI.ViewStateException.ThrowError(Exception inner, String persistedState, String errorPageMessage, Boolean macValidationError) at System.Web.UI.HiddenFieldPageStatePersister.Load() at System.Web.UI.Page.LoadPageStateFromPersistenceMedium() at System.Web.UI.Page.LoadAllState() at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) at System.Web.UI.Page.ProcessRequest(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) at System.Web.UI.Page.ProcessRequest() at System.Web.UI.Page.ProcessRequestWithNoAssert(HttpContext context) at System.Web.UI.Page.ProcessRequest(HttpContext context) at ASP.blogentry_aspx.ProcessRequest(HttpContext context) in c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary ASP.NET Files\root\6486194d\28a90c6b\App_Web_znedlsbn.2.cs:line 0 at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)Inner Exc:System.Web.UI.ViewStateException: Invalid viewstate. Client IP: 65.88.129.2 Port: 46644 User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; .NET CLR 1.1.4322) ViewState: sssssss Referer: Path: /BlogEntry.aspx ---> System.FormatException: Invalid length for a Base-64 char array. at System.Convert.FromBase64String(String s) at System.Web.UI.ObjectStateFormatter.Deserialize(String inputString) at System.Web.UI.ObjectStateFormatter.System.Web.UI.IStateFormatter.Deserialize(String serializedState) at System.Web.UI.Util.DeserializeWithAssert(IStateFormatter formatter, String serializedState) at System.Web.UI.HiddenFieldPageStatePersister.Load() --- End of inner exception stack trace ---Inner/Inner Exc:System.FormatException: Invalid length for a Base-64 char array. at System.Convert.FromBase64String(String s) at System.Web.UI.ObjectStateFormatter.Deserialize(String inputString) at System.Web.UI.ObjectStateFormatter.System.Web.UI.IStateFormatter.Deserialize(String serializedState) at System.Web.UI.Util.DeserializeWithAssert(IStateFormatter formatter, String serializedState) at System.Web.UI.HiddenFieldPageStatePersister.Load()
I recently had a requirement from a customer to generate a PDF that looked exactly like the webpage. Instead of trying to recreate the webpage in Crystal or SQL Server Reports, I decided it would be much easier, cheaper and maintainable to simply take the webpage's HTML, load it into a 3rd party PDF generator and create a PDF.
Below is the code you need to do this.
The important part is that you need to override the page's "OnPreRenderComplete" event and extract the HTML from the base.
1protectedoverridevoid OnPreRenderComplete(EventArgs e)
2{
3 GeneratePDFFromPageHTML();
4}
5
6protected void GeneratePDFFromPageHTML()
7{
8 StringWriter sw;
9 HtmlTextWriter htmltw;
10
11 try
12 {
13 //Get the current page's HTML
14 sw = new StringWriter();
15 htmltw = new HtmlTextWriter(sw);
16 base.Render(htmltw);
17 StringBuilder html = sw.GetStringBuilder();
18
19 //Generate PDF with HTML here. Code not supplied
20 //since there are so many 3rd party PDF generators.
21 //When done generating PDF, either load it into the
22 //browser or stream it back as an attachment
23 }
24 catch (Exception ex)
25 {
26 //Handle exception
27 }
28 finally
29 {
30 if (htmltw != null)
31 {
32 htmltw.Close();
33 htmltw.Dispose();
34 htmltw = null;
35 }
36 if (sw != null)
37 {
38 sw.Close();
39 sw.Dispose();
40 sw = null;
41 }
42 }
43}
Regular Expressions can be very helpful when searching for strings of text.
I use them all the time and recently came across and amazing website:
http://regexlib.com - Regular Expression Library
Some of the pages I use a lot:
Cheat Sheet: http://regexlib.com/CheatSheet.aspx
Tester: http://regexlib.com/RETester.aspx
Over the past 10 years, I have had the pleasure to work with some great customers. I also have had the unfortunate honor of working with some "difficult" customers. Below are a few things I have learned and my advice to you if you're looking to build or re-build your company website. Good luck!
1. Do Your HomeworkThe number one complaint from designers and developers is that customers cannot articulate what they want. Designers and developers are not mind readers and often are not very familiar with your industry. You need to specify what features you want exactly with great detail. Visit competitor sites and see what they have. It's OK to copy ideas, but make them unique in your own way if possible. The design process is very important and takes time. There will be several iterations until a finished design is made. But if you have a clear vision, you can save time and money because the designers won't have to keep reworking the design while you figure out what you want.
2. Don't Cut CornersThere are several phases to any website project. A typical life cycle is:
1. Project Inception - Imagine how great the site could be if it were created without limits. Write these ideas down.
2. Project Launch Meeting - Brainstorming session. No decisions will be made, but you'll talk about things like colors, fonts, logos, navigation and who the site will target (very important!).
3. Project Requirements Document - Essentially you want this document to summarize what the team currently thinks the project will look like when it's completed. There's no need to get it all right or feel tied down.
4. Discovery Documents - With these documents, you'll get down to working out who your target audience is, what they want, the sections of the site, and what they will contain. This will become the true roadmap for the project.
5. Prototype - The prototype will turn your documents into a reality. Now is the time to simply produce a visual that the client can see.
6. Development - The designers will give the developers the necessary files and developers will build the website. After some time, a rough website will be produced and give everyone a chance to see where the project now stands...and appreciate how far it has come.
7. Testing and Final Approval - Get lots of people to use your site and make sure they enjoy it. Take their comments into consideration and tweak where appropriate.
8. Go Live - Turn the site onto the world. Monitor the usage, watch for errors and fix them. Get people's reaction.
Maintenance will be necessary. A lot of people think that once the site is live, it's done. Not true at all. I'll discuss this more later (#7 below).
3. You Need To Provide ContentOnce you have your Discovery documents complete, the designers and developers will start working. This is when you need to start building the content for the website. Start early and stay on top of it! Writing content for most people is difficult and takes a lot of time. It's very easy to push it off until the end...this will severely push back your timeline. Also, some customers think the web designers or developers write the content. This doesn't make any sense. It's your business, you're the expert. You have to write the content.
4. Too Flashy = No TrafficWeb sites don't need to be over-the-top to get results. Some clients will ask for lots of animation and cool graphics. They want a splash page built in Flash, and Javascript rollover images for menu items. They don't understand that (a) these things can turn off visitors who don't have a fast Internet connection or up-to-date browsers or extensions; (b) these things eat up bandwidth; (c) these things make a web site much less visible to search engines. In short, they rarely help, and often hurt, the majority of business web sites. The structure should be defined by the content, not the look. The look is important but only to support the content...not to control the content. A lot of designers are user interface experts - listen to them.
5. It Costs More And Takes Longer than You ThinkJust because you can create a website in a few minutes with an off-the-shelf package doesn't mean it's the right solution for your business. These off-the-shelf packages are bought by thousands of people and your website will looked like a cheesy, canned solution. They also only have a limited feature set. If the feature you need isn't there, you will not get it and no developer will be able to add it in.
It's not that the website is hard to build, it's just that it takes time. Good, experienced developers who have built hundreds of websites over that time typically create a library of common code. These developers can help you save a lot of money by reusing their common code...but if you have something unique it needs to be created from scratch. Just remember that you only get what you pay for. If you get something cheap, there is always a catch. The lowest bidder is the lowest bidder for a reason. Also, most people in the Web industry are clueless. The majority of web developers need to update their skills to what is required for the 21st century. Anyone can build a website...but it doesn’t always mean that they should.
6. If You Build It, They Won't Necessarily ComeThere's a commercial where a company launches their website and immediately starts getting millions of sales. That doesn't happen. Your website will not be the first result in any Google search the first day you go live. Getting your website on the first page of Google takes time (it'll take months), commitment and money. Building the site is not the same thing as marketing it. There are hundreds of companies whose sole purpose is to help you market you website. It's not "If we build it, they will come, and throw money at us." For most businesses, it's more like "If we build it, and dedicate effort to keeping it fresh and up to date and interesting, and if we're selling something people really want to buy, and if we think of the web site as only a part of our marketing effort, and if we pay attention to having clean code and optimized pages and tweak our pay-per-click keywords effectively, we should be more successful with a website than without."
7. Maintenance Is A MustI know a lot of developers who do not create maintenance sites for their customers. Whenever the customer needs an update, they have to pay for the developer's services. This nickle-and-diming approach should be avoided. When creating your Project Requirements document, be sure a secure administrative section of your website is developed. Every reasonable feature of your website must be maintainable...meaning products and content, but not the design and most look-and-feel. You also need to have the ability to move your website from one provider to the next without having to contact a developer. All web sites require at least some degree of maintenance. Try to control that maintenance yourself. After a year or two, major changes (i.e. complete website redesign) may be needed.
8. Designers And Developers Are ProfessionalsMany web designers and developers are appalled at assumptions that their skills are basic and valueless. Professional design takes a lot of work, skill, education and ability. Do you think that a professional chef is a person who puts a bunch of ingredients that don't match into a big pan and sticks it in the oven? You should expect to assign one person from your company to interact with the designers and developers. There is nothing that will cause more confusion, anger and disappointment than the fact that one designer has 3 or 4 people calling and e-mailing them every 5 minutes...each person changing what the last person said.
Conclusion
I believe these points are very important. As a developer, I try to have a discussion about these points with all new customers. Almost all "get it" and agree...and it's a pleasure to do business with these people. In fact, if I know they will work this way, I know the process will be faster and easier and, therefore, much more affordable for them. If I know the customer is going to do things the hard way, I will add in the necessary hours to compensate for the extra time needed to complete their project.
Bonus (for Developers): From My Experience...Over the past 10 years, I have built hundreds of websites and still learn new things everyday. Here are some of my practices that allow me to work very fast and efficiently:
Design: I used to use graphic artists, but now use templates whenever possible. Why? A custom design costs between $1,000 and $4,000. Professional templates are about $60. Sure, other people can buy these templates too, but for most small-medium businesses, who cares? It's worth saving the thousands of dollars. Also, you get all of the source files, so you can tweak them to make them your own. All you'd have in common with others is the colors/style...it's your own logo and content...change the graphics. My favorite template site: http://www.templatemonster.com
Development: I write all code in .NET, C# specifically. The .NET 1.0 platform came out in 2001 and has developed into the most amazing development platform. The set of libraries and namespaces is so large and includes everything you could possibly want to do. .NET is by far the fastest development platform and is the future of software. (Bold, but true statement) When appropriate, use the Microsoft Application Blocks... I use several of them in most of my projects.
Development: Build your own library of reusable code. Over the past few years, I have continued to extend my "BrianPautsch" namespace. It contains distinct classes and methods for specific tasks including exception handling, validation, emailing, form helpers, database access, ecommerce, SSL, image handling, captcha, advanced search engines, etc. When I start on a new project, I can complete more than 1/2 the project requirements in less than an hour because of my library of code.
Development: Use a code generator. There are a bunch out there, so you don't have to create your own. I created my own a few years ago and use it to generate parts of the Business layer, the Database layer and all related Stored Procedures. Once I have the database schema completed, I can generate code and stored procedures in seconds with just one click. Not only am I saving hours and hours of work (extremely boring work!), but the code is 100% perfect and ready to be used. Now, with my library of code and the code generator, I can have a website 75% complete in only an hour or so. This is how I can create a fully functional ecommerce site in a weekend.
Search Engines: Create Google Sitemaps with my Google Sitemap Generator (shameless plug). I have seen websites get into the Google index within 24 hours this way. I have also seen sites already in Google jump in the results by simply creating a Google Sitemap.
Back on 3/9/2005, I published a blog titled
Website Testing - Automation, Autofill, etc. (C# WinForms).
Since then, dozens of people have downloaded it and used it successfully to build projects with similar technology.
Today, "Herman" asked how to populate forms, click buttons, etc. in frames. Of course, frames are a big no no, but they do exist.
This blog explains how to do what Herman is asking for.
Download code
The only real difference from the previous blog is lines 5-29. In lines 5-14, I iterate through the frames collection and find
the specific frame by name. on line 20, I load that frame into its own IHTMLDocument2 object. And finally, on lines 22-28, I find
the controls and manipulate them. That's it!
1//Get the web browser document
2myDoc = new HTMLDocumentClass();
3myDoc = (HTMLDocument) WebBrowser.Document;
4
5//Find the frame named "content"
6FramesCollection myFramesColl = myDoc.frames;
7IHTMLWindow2 myContentFrame = null;
8for (int i = 0; i < myFramesColl.length; i++)
9{
10object refIndex = i;
11 mshtml.IHTMLWindow2 frame = (mshtml.IHTMLWindow2)myFramesColl.item(ref refIndex);
12if (frame.name == "content")
13 myContentFrame = frame;
14}
15
16//Frame found?
17if (myContentFrame != null)
18{
19//Load into IHTMLDocument2 object
20 IHTMLDocument2 myContentFrameDoc = myContentFrame.document;
21
22//Find the textbox
23 HTMLInputElement objTextBox = (HTMLInputElement)
myContentFrameDoc.all.item("txtEmail", 0);
24 objTextBox.value = txtEmail.Text;
25
26//Click the selected button
27 HTMLInputElement btnSearch = (HTMLInputElement)
myContentFrameDoc.all.item("cmdJoin", 0);
28 btnSearch.click();
29}
From the Stanford archives...
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin and Lawrence Page{sergey, page}@cs.stanford.eduComputer Science Department, Stanford University, Stanford, CA 94305Abstract
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/
To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date.
Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Keywords: World Wide Web, Search Engines, Information Retrieval, PageRank, Google
1. Introduction
(Note: There are two versions of this paper -- a longer full version and a shorter printed version. The full version is available on the web and the conference CD-ROM.)The web creates new challenges for information retrieval. The amount of information on the web is growing rapidly, as well as the number of new users inexperienced in the art of web research. People are likely to surf the web using its link graph, often starting with high quality human maintained indices such as
Yahoo! or with search engines. Human maintained lists cover popular topics effectively but are subjective, expensive to build and maintain, slow to improve, and cannot cover all esoteric topics. Automated search engines that rely on keyword matching usually return too many low quality matches. To make matters worse, some advertisers attempt to gain people's attention by taking measures meant to mislead automated search engines. We have built a large-scale search engine which addresses many of the problems of existing systems. It makes especially heavy use of the additional structure present in hypertext to provide much higher quality search results. We chose our system name, Google, because it is a common spelling of googol, or 10
100 and fits well with our goal of building very large-scale search engines.
1.1 Web Search Engines -- Scaling Up: 1994 - 2000
Search engine technology has had to scale dramatically to keep up with the growth of the web. In 1994, one of the first web search engines, the World Wide Web Worm (WWWW)
[McBryan 94] had an index of 110,000 web pages and web accessible documents. As of November, 1997, the top search engines claim to index from 2 million (WebCrawler) to 100 million web documents (from
Search Engine Watch). It is foreseeable that by the year 2000, a comprehensive index of the Web will contain over a billion documents. At the same time, the number of queries search engines handle has grown incredibly too. In March and April 1994, the World Wide Web Worm received an average of about 1500 queries per day. In November 1997, Altavista claimed it handled roughly 20 million queries per day. With the increasing number of users on the web, and automated systems which query search engines, it is likely that top search engines will handle hundreds of millions of queries per day by the year 2000. The goal of our system is to address many of the problems, both in quality and scalability, introduced by scaling search engine technology to such extraordinary numbers.
1.2. Google: Scaling with the Web
Creating a search engine which scales even to today's web presents many challenges. Fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thousands per second.
These tasks are becoming increasingly difficult as the Web grows. However, hardware performance and cost have improved dramatically to partially offset the difficulty. There are, however, several notable exceptions to this progress such as disk seek time and operating system robustness. In designing Google, we have considered both the rate of growth of the Web and technological changes. Google is designed to scale well to extremely large data sets. It makes efficient use of storage space to store the index. Its data structures are optimized for fast and efficient access (see section 4.2). Further, we expect that the cost to index and store text or HTML will eventually decline relative to the amount that will be available (see Appendix B). This will result in favorable scaling properties for centralized systems like Google.
1.3 Design Goals
1.3.1 Improved Search Quality
Our main goal is to improve the quality of web search engines. In 1994, some people believed that a complete search index would make it possible to find anything easily. According to
Best of the Web 1994 -- Navigators, "The best navigation service should make it easy to find almost anything on the Web (once all the data is entered)." However, the Web of 1997 is quite different. Anyone who has used a search engine recently, can readily testify that the completeness of the index is not the only factor in the quality of search results. "Junk results" often wash out any results that a user is interested in. In fact, as of November 1997, only one of the top four commercial search engines finds itself (returns its own search page in response to its name in the top ten results). One of the main causes of this problem is that the number of documents in the indices has been increasing by many orders of magnitude, but the user's ability to look at documents has not. People are still only willing to look at the first few tens of results. Because of this, as the collection size grows, we need tools that have very high precision (number of relevant documents returned, say in the top tens of results). Indeed, we want our notion of "relevant" to only include the very best documents since there may be tens of thousands of slightly relevant documents. This very high precision is important even at the expense of recall (the total number of relevant documents the system is able to return). There is quite a bit of recent optimism that the use of more hypertextual information can help improve search and other applications [
Marchiori 97] [
Spertus 97] [
Weiss 96] [
Kleinberg 98]. In particular, link structure [
Page 98] and link text provide a lot of information for making relevance judgments and quality filtering. Google makes use of both link structure and anchor text (see Sections
2.1 and
2.2).
1.3.2 Academic Search Engine Research
Aside from tremendous growth, the Web has also become increasingly commercial over time. In 1993, 1.5% of web servers were on .com domains. This number grew to over 60% in 1997. At the same time, search engines have migrated from the academic domain to the commercial. Up until now most search engine development has gone on at companies with little publication of technical details. This causes search engine technology to remain largely a black art and to be advertising oriented (see
Appendix A). With Google, we have a strong goal to push more development and understanding into the academic realm.
Another important design goal was to build systems that reasonable numbers of people can actually use. Usage was important to us because we think some of the most interesting research will involve leveraging the vast amount of usage data that is available from modern web systems. For example, there are many tens of millions of searches performed every day. However, it is very difficult to get this data, mainly because it is considered commercially valuable.
Our final design goal was to build an architecture that can support novel research activities on large-scale web data. To support novel research uses, Google stores all of the actual documents it crawls in compressed form. One of our main goals in designing Google was to set up an environment where other researchers can come in quickly, process large chunks of the web, and produce interesting results that would have been very difficult to produce otherwise. In the short time the system has been up, there have already been several papers using databases generated by Google, and many others are underway. Another goal we have is to set up a Spacelab-like environment where researchers or even students can propose and do interesting experiments on our large-scale web data.
2. System Features
The Google search engine has two important features that help it produce high precision results. First, it makes use of the link structure of the Web to calculate a quality ranking for each web page. This ranking is called PageRank and is described in detail in [Page 98]. Second, Google utilizes link to improve search results.
2.1 PageRank: Bringing Order to the Web
The citation (link) graph of the web is an important resource that has largely gone unused in existing web search engines. We have created maps containing as many as 518 million of these hyperlinks, a significant sample of the total. These maps allow rapid calculation of a web page's "PageRank", an objective measure of its citation importance that corresponds well with people's subjective idea of importance. Because of this correspondence, PageRank is an excellent way to prioritize the results of web keyword searches. For most popular subjects, a simple text matching search that is restricted to web page titles performs admirably when PageRank prioritizes the results (demo available at
google.stanford.edu). For the type of full text searches in the main Google system, PageRank also helps a great deal.
2.1.1 Description of PageRank Calculation
Academic citation literature has been applied to the web, largely by counting citations or backlinks to a given page. This gives some approximation of a page's importance or quality. PageRank extends this idea by not counting links from all pages equally, and by normalizing by the number of links on a page. PageRank is defined as follows:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d in the next section. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages' PageRanks will be one.
PageRank or
PR(A) can be calculated using a simple iterative algorithm, and corresponds to the principal eigenvector of the normalized link matrix of the web. Also, a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation. There are many other details which are beyond the scope of this paper.
2.1.2 Intuitive Justification
PageRank can be thought of as a model of user behavior. We assume there is a "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. And, the
d damping factor is the probability at each page the "random surfer" will get bored and request another random page. One important variation is to only add the damping factor
d to a single page, or a group of pages. This allows for personalization and can make it nearly impossible to deliberately mislead the system in order to get a higher ranking. We have several other extensions to PageRank, again see [
Page 98].
Another intuitive justification is that a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo's homepage would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web.
2.2 Anchor Text
The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. Note that pages that have not been crawled can cause problems, since they are never checked for validity before being returned to the user. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens.
This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web Worm [McBryan 94] especially because it helps search non-text information, and expands the search coverage with fewer downloaded documents. We use anchor propagation mostly because anchor text can help provide better quality results. Using anchor text efficiently is technically difficult because of the large amounts of data which must be processed. In our current crawl of 24 million pages, we had over 259 million anchors which we indexed.
2.3 Other Features
Aside from PageRank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font are weighted higher than other words. Third, full raw HTML of pages is available in a repository.
3 Related Work
Search research on the web has a short and concise history. The World Wide Web Worm (WWWW)
[McBryan 94] was one of the first web search engines. It was subsequently followed by several other academic search engines, many of which are now public companies. Compared to the growth of the Web and the importance of search engines there are precious few documents about recent search engines [
Pinkerton 94]. According to Michael Mauldin (chief scientist, Lycos Inc)
[Mauldin], "the various services (including Lycos) closely guard the details of these databases". However, there has been a fair amount of work on specific features of search engines. Especially well represented is work which can get results by post-processing the results of existing commercial search engines, or produce small scale "individualized" search engines. Finally, there has been a lot of research on information retrieval systems, especially on well controlled collections. In the next two sections, we discuss some areas where this research needs to be extended to work better on the web.
3.1 Information Retrieval
Work in information retrieval systems goes back many years and is well developed [
Witten 94]. However, most of the research on information retrieval systems is on small well controlled homogeneous collections such as collections of scientific papers or news stories on a related topic. Indeed, the primary benchmark for information retrieval, the Text Retrieval Conference [
TREC 96], uses a fairly small, well controlled collection for their benchmarks. The "Very Large Corpus" benchmark is only 20GB compared to the 147GB from our crawl of 24 million web pages. Things that work well on TREC often do not produce good results on the web. For example, the standard vector space model tries to return the document that most closely approximates the query, given that both query and document are vectors defined by their word occurrence. On the web, this strategy often returns very short documents that are the query plus a few words. For example, we have seen a major search engine return a page containing only "Bill Clinton Sucks" and picture from a "Bill Clinton" query. Some argue that on the web, users should specify more accurately what they want and add more words to their query. We disagree vehemently with this position. If a user issues a query like "Bill Clinton" they should get reasonable results since there is a enormous amount of high quality information available on this topic. Given examples like these, we believe that the standard information retrieval work needs to be extended to deal effectively with the web.
3.2 Differences Between the Web and Well Controlled Collections
The web is a vast collection of completely uncontrolled heterogeneous documents. Documents on the web have extreme variation internal to the documents, and also in the external meta information that might be available. For example, documents differ internally in their language (both human and programming), vocabulary (email addresses, links, zip codes, phone numbers, product numbers), type or format (text, HTML, PDF, images, sounds), and may even be machine generated (log files or output from a database). On the other hand, we define external meta information as information that can be inferred about a document, but is not contained within it. Examples of external meta information include things like reputation of the source, update frequency, quality, popularity or usage, and citations. Not only are the possible sources of external meta information varied, but the things that are being measured vary many orders of magnitude as well. For example, compare the usage information from a major homepage, like Yahoo's which currently receives millions of page views every day with an obscure historical article which might receive one view every ten years. Clearly, these two items must be treated very differently by a search engine.
Another big difference between the web and traditional well controlled collections is that there is virtually no control over what people can put on the web. Couple this flexibility to publish anything with the enormous influence of search engines to route traffic and companies which deliberately manipulating search engines for profit become a serious problem. This problem that has not been addressed in traditional closed information retrieval systems. Also, it is interesting to note that metadata efforts have largely failed with web search engines, because any text on the page which is not directly represented to the user is abused to manipulate search engines. There are even numerous companies which specialize in manipulating search engines for profit.
4 System Anatomy
First, we will provide a high level discussion of the architecture. Then, there is some in-depth descriptions of important data structures. Finally, the major applications: crawling, indexing, and searching will be examined in depth.
 - Figure 1. High Level Google Architecture
|
4.1 Google Architecture Overview
In this section, we will give a high level overview of how the whole system works as pictured in Figure 1. Further sections will discuss the applications and data structures not mentioned in this section. Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux.
In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.
The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.
4.2 Major Data Structures
Google's data structures are optimized so that a large document collection can be crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates have improved dramatically over the years, a disk seek still requires about 10 ms to complete. Google is designed to avoid disk seeks whenever possible, and this has had a considerable influence on the design of the data structures.
4.2.1 BigFiles
BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit integers. The allocation among multiple file systems is handled automatically. The BigFiles package also handles allocation and deallocation of file descriptors, since the operating systems do not provide enough for our needs. BigFiles also support rudimentary compression options.
4.2.2 Repository
 - Figure 2. Repository Data Structure
|
The repository contains the full HTML of every web page. Each page is compressed using zlib (see
RFC1950). The choice of compression technique is a tradeoff between speed and compression ratio. We chose zlib's speed over a significant improvement in compression offered by
bzip. The compression rate of bzip was approximately 4 to 1 on the repository as compared to zlib's 3 to 1 compression. In the repository, the documents are stored one after the other and are prefixed by docID, length, and URL as can be seen in Figure 2. The repository requires no other data structures to be used in order to access it. This helps with data consistency and makes development much easier; we can rebuild all the other data structures from only the repository and a file which lists crawler errors.
4.2.3 Document Index
The document index keeps information about each document. It is a fixed width ISAM (Index sequential access mode) index, ordered by docID. The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics. If the document has been crawled, it also contains a pointer into a variable width file called docinfo which contains its URL and title. Otherwise the pointer points into the URLlist which contains just the URL. This design decision was driven by the desire to have a reasonably compact data structure, and the ability to fetch a record in one disk seek during a search
Additionally, there is a file which is used to convert URLs into docIDs. It is a list of URL checksums with their corresponding docIDs and is sorted by checksum. In order to find the docID of a particular URL, the URL's checksum is computed and a binary search is performed on the checksums file to find its docID. URLs may be converted into docIDs in batch by doing a merge with this file. This is the technique the URLresolver uses to turn URLs into docIDs. This batch mode of update is crucial because otherwise we must perform one seek for every link which assuming one disk would take more than a month for our 322 million link dataset.
4.2.4 Lexicon
The lexicon has several different forms. One important change from earlier systems is that the lexicon can fit in memory for a reasonable price. In the current implementation we can keep the lexicon in memory on a machine with 256 MB of main memory. The current lexicon contains 14 million words (though some rare words were not added to the lexicon). It is implemented in two parts -- a list of the words (concatenated together but separated by nulls) and a hash table of pointers. For various functions, the list of words has some auxiliary information which is beyond the scope of this paper to explain fully.
4.2.5 Hit Lists
A hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. Hit lists account for most of the space used in both the forward and the inverted indices. Because of this, it is important to represent them as efficiently as possible. We considered several alternatives for encoding position, font, and capitalization -- simple encoding (a triple of integers), a compact encoding (a hand optimized allocation of bits), and Huffman coding. In the end we chose a hand optimized compact encoding since it required far less space than the simple encoding and far less bit manipulation than Huffman coding. The details of the hits are shown in Figure 3.
Our compact encoding uses two bytes for every hit. There are two types of hits: fancy hits and plain hits. Fancy hits include hits occurring in a URL, title, anchor text, or meta tag. Plain hits include everything else. A plain hit consists of a capitalization bit, font size, and 12 bits of word position in a document (all positions higher than 4095 are labeled 4096). Font size is represented relative to the rest of the document using three bits (only 7 values are actually used because 111 is the flag that signals a fancy hit). A fancy hit consists of a capitalization bit, the font size set to 7 to indicate it is a fancy hit, 4 bits to encode the type of fancy hit, and 8 bits of position. For anchor hits, the 8 bits of position are split into 4 bits for position in anchor and 4 bits for a hash of the docID the anchor occurs in. This gives us some limited phrase searching as long as there are not that many anchors for a particular word. We expect to update the way that anchor hits are stored to allow for greater resolution in the position and docIDhash fields. We use font size relative to the rest of the document because when searching, you do not want to rank otherwise identical documents differently just because one of the documents is in a larger font.
 - Figure 3. Forward and Reverse Indexes and the Lexicon
|
The length of a hit list is stored before the hits themselves. To save space, the length of the hit list is combined with the wordID in the forward index and the docID in the inverted index. This limits it to 8 and 5 bits respectively (there are some tricks which allow 8 bits to be borrowed from the wordID). If the length is longer than would fit in that many bits, an escape code is used in those bits, and the next two bytes contain the actual length.
4.2.6 Forward Index
The forward index is actually already partially sorted. It is stored in a number of barrels (we used 64). Each barrel holds a range of wordID's. If a document contains words that fall into a particular barrel, the docID is recorded into the barrel, followed by a list of wordID's with hitlists which correspond to those words. This scheme requires slightly more storage because of duplicated docIDs but the difference is very small for a reasonable number of buckets and saves considerable time and coding complexity in the final indexing phase done by the sorter. Furthermore, instead of storing actual wordID's, we store each wordID as a relative difference from the minimum wordID that falls into the barrel the wordID is in. This way, we can use just 24 bits for the wordID's in the unsorted barrels, leaving 8 bits for the hit list length.
4.2.7 Inverted Index
The inverted index consists of the same barrels as the forward index, except that they have been processed by the sorter. For every valid wordID, the lexicon contains a pointer into the barrel that wordID falls into. It points to a doclist of docID's together with their corresponding hit lists. This doclist represents all the occurrences of that word in all documents.
An important issue is in what order the docID's should appear in the doclist. One simple solution is to store them sorted by docID. This allows for quick merging of different doclists for multiple word queries. Another option is to store them sorted by a ranking of the occurrence of the word in each document. This makes answering one word queries trivial and makes it likely that the answers to multiple word queries are near the start. However, merging is much more difficult. Also, this makes development much more difficult in that a change to the ranking function requires a rebuild of the index. We chose a compromise between these options, keeping two sets of inverted barrels -- one set for hit lists which include title or anchor hits and another set for all hit lists. This way, we check the first set of barrels first and if there are not enough matches within those barrels we check the larger ones.
4.3 Crawling the Web
Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.
In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.
It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, "Wow, you looked at a lot of pages from my web site. How did you like it?" There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, "This page is copyrighted and should not be indexed", which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.
4.4 Indexing the Web
- Parsing -- Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone's imagination to come up with equally creative ones. For maximum speed, instead of using YACC to generate a CFG parser, we use flex to generate a lexical analyzer which we outfit with its own stack. Developing this parser which runs at a reasonable speed and is very robust involved a fair amount of work.
- IndexingDocuments into Barrels -- After each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordID by using an in-memory hash table -- the lexicon. New additions to the lexicon hash table are logged to a file. Once the words are converted into wordID's, their occurrences in the current document are translated into hit lists and are written into the forward barrels. The main difficulty with parallelization of the indexing phase is that the lexicon needs to be shared. Instead of sharing the lexicon, we took the approach of writing a log of all the extra words that were not in a base lexicon, which we fixed at 14 million words. That way multiple indexers can run in parallel and then the small log file of extra words can be processed by one final indexer.
- Sorting -- In order to generate the inverted index, the sorter takes each of the forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel. This process happens one barrel at a time, thus requiring little temporary storage. Also, we parallelize the sorting phase to use as many machines as we have simply by running multiple sorters, which can process different buckets at the same time. Since the barrels don't fit into main memory, the sorter further subdivides them into baskets which do fit into memory based on wordID and docID. Then the sorter, loads each basket into memory, sorts it and writes its contents into the short inverted barrel and the full inverted barrel.
4.5 Searching
The goal of searching is to provide quality search results efficiently. Many of the large commercial search engines seemed to have made great progress in terms of efficiency. Therefore, we have focused more on quality of search in our research, although we believe our solutions are scalable to commercial volumes with a bit more effort. The google query evaluation process is show in Figure 4.
- Parse the query.
- Convert words into wordIDs.
- Seek to the start of the doclist in the short barrel for every word.
- Scan through the doclists until there is a document that matches all the search terms.
- Compute the rank of that document for the query.
- If we are in the short barrels and at the end of any doclist, seek to the start of the doclist in the full barrel for every word and go to step 4.
- If we are not at the end of any doclist go to step 4.
Sort the documents that have matched by rank and return the top k.
- Figure 4. Google Query Evaluation
|
To put a limit on response time, once a certain number (currently 40,000) of matching documents are found, the searcher automatically goes to step 8 in Figure 4. This means that it is possible that sub-optimal results would be returned. We are currently investigating other ways to solve this problem. In the past, we sorted the hits according to PageRank, which seemed to improve the situation.
4.5.1 The Ranking System
Google maintains much more information about web documents than typical search engines. Every hitlist includes position, font, and capitalization information. Additionally, we factor in hits from anchor text and the PageRank of the document. Combining all of this information into a rank is difficult. We designed our ranking function so that no particular factor can have too much influence. First, consider the simplest case -- a single word query. In order to rank a document with a single word query, Google looks at that document's hit list for that word. Google considers each hit to be one of several different types (title, anchor, URL, plain text large font, plain text small font, ...), each of which has its own type-weight. The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list. Then every count is converted into a count-weight. Count-weights increase linearly with counts at first but quickly taper off so that more than a certain count will not help. We take the dot product of the vector of count-weights with the vector of type-weights to compute an IR score for the document. Finally, the IR score is combined with PageRank to give a final rank to the document.
For a multi-word search, the situation is more complicated. Now multiple hit lists must be scanned through at once so that hits occurring close together in a document are weighted higher than hits occurring far apart. The hits from the multiple hit lists are matched up so that nearby hits are matched together. For every matched set of hits, a proximity is computed. The proximity is based on how far apart the hits are in the document (or anchor) but is classified into 10 different value "bins" ranging from a phrase match to "not even close". Counts are computed not only for every type of hit but for every type and proximity. Every type and proximity pair has a type-prox-weight. The counts are converted into count-weights and we take the dot product of the count-weights and the type-prox-weights to compute an IR score. All of these numbers and matrices can all be displayed with the search results using a special debug mode. These displays have been very helpful in developing the ranking system.
4.5.2 Feedback
The ranking function has many parameters like the type-weights and the type-prox-weights. Figuring out the right values for these parameters is something of a black art. In order to do this, we have a user feedback mechanism in the search engine. A trusted user may optionally evaluate all of the results that are returned. This feedback is saved. Then when we modify the ranking function, we can see the impact of this change on all previous searches which were ranked. Although far from perfect, this gives us some idea of how a change in the ranking function affects the search results.
5 Results and Performance
The most important measure of a search engine is the quality of its search results. While a complete user evaluation is beyond the scope of this paper, our own experience with Google has shown it to produce better results than the major commercial search engines for most searches. As an example which illustrates the use of PageRank, anchor text, and proximity, Figure 4 shows Google's results for a search on "bill clinton". These results demonstrates some of Google's features. The results are clustered by server. This helps considerably when sifting through result sets. A number of results are from the whitehouse.gov domain which is what one may reasonably expect from such a search. Currently, most major commercial search engines do not return any results from whitehouse.gov, much less the right ones. Notice that there is no title for the first result. This is because it was not crawled. Instead, Google relied on anchor text to determine this was a good answer to the query. Similarly, the fifth result is an email address which, of course, is not crawlable. It is also a result of anchor text.
All of the results are reasonably high quality pages and, at last check, none were broken links. This is largely because they all have high PageRank. The PageRanks are the percentages in red along with bar graphs. Finally, there are no results about a Bill other than Clinton or about a Clinton other than Bill. This is because we place heavy importance on the proximity of word occurrences. Of course a true test of the quality of a search engine would involve an extensive user study or results analysis which we do not have room for here. Instead, we invite the reader to try Google for themselves at http://google.stanford.edu.
5.1 Storage Requirements
Aside from search quality, Google is designed to scale cost effectively to the size of the Web as it grows. One aspect of this is to use storage efficiently. Table 1 has a breakdown of some statistics and storage requirements of Google. Due to compression the total size of the repository is about 53 GB, just over one third of the total data it stores. At current disk prices this makes the repository a relatively cheap source of useful data. More importantly, the total of all the data used by the search engine requires a comparable amount of storage, about 55 GB. Furthermore, most queries can be answered using just the short inverted index. With better encoding and compression of the Document Index, a high quality web search engine may fit onto a 7GB drive of a new PC.
| Storage Statistics |
|---|
| Total Size of Fetched Pages | 147.8 GB | | Compressed Repository | 53.5 GB | | Short Inverted Index | 4.1 GB | | Full Inverted Index | 37.2 GB | | Lexicon | 293 MB | Temporary Anchor Data (not in total) | 6.6 GB | Document Index Incl. Variable Width Data | 9.7 GB | | Links Database | 3.9 GB | | Total Without Repository | 55.2 GB |
|---|
| Total With Repository | 108.7 GB |
|---|
|
| Web Page Statistics |
|---|
| Number of Web Pages Fetched | 24 million | | Number of Urls Seen | 76.5 million | | Number of Email Addresses | 1.7 million | | Number of 404's | 1.6 million | |
| Table 1. Statistics |
5.2 System Performance
It is important for a search engine to crawl and index efficiently. This way information can be kept up to date and major changes to the system can be tested relatively quickly. For Google, the major operations are Crawling, Indexing, and Sorting. It is difficult to measure how long crawling took overall because disks filled up, name servers crashed, or any number of other problems which stopped the system. In total it took roughly 9 days to download the 26 million pages (including errors). However, once the system was running smoothly, it ran much faster, downloading the last 11 million pages in just 63 hours, averaging just over 4 million pages per day or 48.5 pages per second. We ran the indexer and the crawler simultaneously. The indexer ran just faster than the crawlers. This is largely because we spent just enough time optimizing the indexer so that it would not be a bottleneck. These optimizations included bulk updates to the document index and placement of critical data structures on the local disk. The indexer runs at roughly 54 pages per second. The sorters can be run completely in parallel; using four machines, the whole process of sorting takes about 24 hours.
5.3 Search Performance
Improving the performance of search was not the major focus of our research up to this point. The current version of Google answers most queries in between 1 and 10 seconds. This time is mostly dominated by disk IO over NFS (since disks are spread over a number of machines). Furthermore, Google does not have any optimizations such as query caching, subindices on common terms, and other common optimizations. We intend to speed up Google considerably through distribution and hardware, software, and algorithmic improvements. Our target is to be able to handle several hundred queries per second. Table 2 has some sample query times from the current version of Google. They are repeated to show the speedups resulting from cached IO.
| | Initial Query | Same Query Repeated (IO mostly cached) | | Query | CPU Time(s) | Total Time(s) | CPU Time(s) | Total Time(s) |
|---|
| al gore | 0.09 | 2.13 | 0.06 | 0.06 | | vice president | 1.77 | 3.84 | 1.66 | 1.80 | | hard disks | 0.25 | 4.86 | 0.20 | 0.24 | | search engines | 1.31 | 9.63 | 1.16 | 1.16 | |
| Table 2. Search Times |
6 Conclusions
Google is designed to be a scalable search engine. The primary goal is to provide high quality search results over a rapidly growing World Wide Web. Google employs a number of techniques to improve search quality including page rank, anchor text, and proximity information. Furthermore, Google is a complete architecture for gathering web pages, indexing them, and performing search queries over them.
6.1 Future Work
A large-scale web search engine is a complex system and much remains to be done. Our immediate goals are to improve search efficiency and to scale to approximately 100 million web pages. Some simple improvements to efficiency include query caching, smart disk allocation, and subindices. Another area which requires much research is updates. We must have smart algorithms to decide what old web pages should be recrawled and what new ones should be crawled. Work toward this goal has been done in [
Cho 98]. One promising area of research is using proxy caches to build search databases, since they are demand driven. We are planning to add simple features supported by commercial search engines like boolean operators, negation, and stemming. However, other features are just starting to be explored such as relevance feedback and clustering (Google currently supports a simple hostname based clustering). We also plan to support user context (like the user's location), and result summarization. We are also working to extend the use of link structure and link text. Simple experiments indicate PageRank can be personalized by increasing the weight of a user's home page or bookmarks. As for link text, we are experimenting with using text surrounding links in addition to the link text itself. A Web search engine is a very rich environment for research ideas. We have far too many to list here so we do not expect this Future Work section to become much shorter in the near future.
6.2 High Quality Search
The biggest problem facing users of web search engines today is the quality of the results they get back. While the results are often amusing and expand users' horizons, they are often frustrating and consume precious time. For example, the top result for a search for "Bill Clinton" on one of the most popular commercial search engines was the
Bill Clinton Joke of the Day: April 14, 1997. Google is designed to provide higher quality search so as the Web continues to grow rapidly, information can be found easily. In order to accomplish this Google makes heavy use of hypertextual information consisting of link structure and link (anchor) text. Google also uses proximity and font information. While evaluation of a search engine is difficult, we have subjectively found that Google returns higher quality search results than current commercial search engines. The analysis of link structure via PageRank allows Google to evaluate the quality of web pages. The use of link text as a description of what the link points to helps the search engine return relevant (and to some degree high quality) results. Finally, the use of proximity information helps increase relevance a great deal for many queries.
6.3 Scalable Architecture
Aside from the quality of search, Google is designed to scale. It must be efficient in both space and time, and constant factors are very important when dealing with the entire Web. In implementing Google, we have seen bottlenecks in CPU, memory access, memory capacity, disk seeks, disk throughput, disk capacity, and network IO. Google has evolved to overcome a number of these bottlenecks during various operations. Google's major data structures make efficient use of available storage space. Furthermore, the crawling, indexing, and sorting operations are efficient enough to be able to build an index of a substantial portion of the web -- 24 million pages, in less than one week. We expect to be able to build an index of 100 million pages in less than a month.
6.4 A Research Tool
In addition to being a high quality search engine, Google is a research tool. The data Google has collected has already resulted in many other papers submitted to conferences and many more on the way. Recent research such as [
Abiteboul 97] has shown a number of limitations to queries about the Web that may be answered without having the Web available locally. This means that Google (or a similar system) is not only a valuable research tool but a necessary one for a wide range of applications. We hope Google will be a resource for searchers and researchers all around the world and will spark the next generation of search engine technology.
7 Acknowledgments
Scott Hassan and Alan Steremberg have been critical to the development of Google. Their talented contributions are irreplaceable, and the authors owe them much gratitude. We would also like to thank Hector Garcia-Molina, Rajeev Motwani, Jeff Ullman, and Terry Winograd and the whole WebBase group for their support and insightful discussions. Finally we would like to recognize the generous support of our equipment donors IBM, Intel, and Sun and our funders. The research described here was conducted as part of the Stanford Integrated Digital Library Project, supported by the National Science Foundation under Cooperative Agreement IRI-9411306. Funding for this cooperative agreement is also provided by DARPA and NASA, and by Interval Research, and the industrial partners of the Stanford Digital Libraries Project.
References
- [Abiteboul 97] Serge Abiteboul and Victor Vianu, Queries and Computation on the Web. Proceedings of the International Conference on Database Theory. Delphi, Greece 1997.
- [Bagdikian 97] Ben H. Bagdikian. The Media Monopoly. 5th Edition. Publisher: Beacon, ISBN: 0807061557
- [Chakrabarti 98] S.Chakrabarti, B.Dom, D.Gibson, J.Kleinberg, P. Raghavan and S. Rajagopalan. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998.
- [Cho 98] Junghoo Cho, Hector Garcia-Molina, Lawrence Page. Efficient Crawling Through URL Ordering. Seventh International Web Conference (WWW 98). Brisbane, Australia, April 14-18, 1998.
- [Gravano 94] Luis Gravano, Hector Garcia-Molina, and A. Tomasic. The Effectiveness of GlOSS for the Text-Database Discovery Problem. Proc. of the 1994 ACM SIGMOD International Conference On Management Of Data, 1994.
- [Kleinberg 98] Jon Kleinberg, Authoritative Sources in a Hyperlinked Environment, Proc. ACM-SIAM Symposium on Discrete Algorithms, 1998.
- [Marchiori 97] Massimo Marchiori. The Quest for Correct Information on the Web: Hyper Search Engines. The Sixth International WWW Conference (WWW 97). Santa Clara, USA, April 7-11, 1997.
- [McBryan 94] Oliver A. McBryan. GENVL and WWWW: Tools for Taming the Web. First International Conference on the World Wide Web. CERN, Geneva (Switzerland), May 25-26-27 1994. http://www.cs.colorado.edu/home/mcbryan/mypapers/www94.ps
- [Page 98] Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Manuscript in progress. http://google.stanford.edu/~backrub/pageranksub.ps
- [Pinkerton 94] Brian Pinkerton, Finding What People Want: Experiences with the WebCrawler. The Second International WWW Conference Chicago, USA, October 17-20, 1994. http://info.webcrawler.com/bp/WWW94.html
- [Spertus 97] Ellen Spertus. ParaSite: Mining Structural Information on the Web. The Sixth International WWW Conference (WWW 97). Santa Clara, USA, April 7-11, 1997.
- [TREC 96] Proceedings of the fifth Text REtrieval Conference (TREC-5). Gaithersburg, Maryland, November 20-22, 1996. Publisher: Department of Commerce, National Institute of Standards and Technology. Editors: D. K. Harman and E. M. Voorhees. Full text at: http://trec.nist.gov/
- [Witten 94] Ian H Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. New York: Van Nostrand Reinhold, 1994.
- [Weiss 96] Ron Weiss, Bienvenido Velez, Mark A. Sheldon, Chanathip Manprempre, Peter Szilagyi, Andrzej Duda, and David K. Gifford. HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proceedings of the 7th ACM Conference on Hypertext. New York, 1996.
Vitae

Sergey Brin received his B.S. degree in mathematics and computer science from the University of Maryland at College Park in 1993. Currently, he is a Ph.D. candidate in computer science at Stanford University where he received his M.S. in 1995. He is a recipient of a National Science Foundation Graduate Fellowship. His research interests include search engines, information extraction from unstructured sources, and data mining of large text collections and scientific data.
Lawrence Page was born in East Lansing, Michigan, and received a B.S.E. in Computer Engineering at the University of Michigan Ann Arbor in 1995. He is currently a Ph.D. candidate in Computer Science at Stanford University. Some of his research interests include the link structure of the web, human computer interaction, search engines, scalability of information access interfaces, and personal data mining.
8 Appendix A: Advertising and Mixed Motives
Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is "
The Effect of Cellular Phone Use Upon Driver Attention", a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [
Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [
Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.
Since it is very difficult even for experts to evaluate search engines, search engine bias is particularly insidious. A good example was OpenText, which was reported to be selling companies the right to be listed at the top of the search results for particular queries [Marchiori 97]. This type of bias is much more insidious than advertising, because it is not clear who "deserves" to be there, and who is willing to pay money to be listed. This business model resulted in an uproar, and OpenText has ceased to be a viable search engine. But less blatant bias are likely to be tolerated by the market. For example, a search engine could add a small factor to search results from "friendly" companies, and subtract a factor from results from competitors. This type of bias is very difficult to detect but could still have a significant effect on the market. Furthermore, advertising income often provides an incentive to provide poor quality search results. For example, we noticed a major search engine would not return a large airline's homepage when the airline's name was given as a query. It so happened that the airline had placed an expensive ad, linked to the query that was its name. A better search engine would not have required this ad, and possibly resulted in the loss of the revenue from the airline to the search engine. In general, it could be argued from the consumer point of view that the better the search engine is, the fewer advertisements will be needed for the consumer to find what they want. This of course erodes the advertising supported business model of the existing search engines. However, there will always be money from advertisers who want a customer to switch products, or have something that is genuinely new. But we believe the issue of advertising causes enough mixed incentives that it is crucial to have a competitive search engine that is transparent and in the academic realm.
9 Appendix B: Scalability
9. 1 Scalability of Google
We have designed Google to be scalable in the near term to a goal of 100 million web pages. We have just received disk and machines to handle roughly that amount. All of the time consuming parts of the system are parallelize and roughly linear time. These include things like the crawlers, indexers, and sorters. We also think that most of the data structures will deal gracefully with the expansion. However, at 100 million web pages we will be very close up against all sorts of operating system limits in the common operating systems (currently we run on both Solaris and Linux). These include things like addressable memory, number of open file descriptors, network sockets and bandwidth, and many others. We believe expanding to a lot more than 100 million pages would greatly increase the complexity of our system.
9.2 Scalability of Centralized Indexing Architectures
As the capabilities of computers increase, it becomes possible to index a very large amount of text for a reasonable cost. Of course, other more bandwidth intensive media such as video is likely to become more pervasive. But, because the cost of production of text is low compared to media like video, text is likely to remain very pervasive. Also, it is likely that soon we will have speech recognition that does a reasonable job converting speech into text, expanding the amount of text available. All of this provides amazing possibilities for centralized indexing. Here is an illustrative example. We assume we want to index everything everyone in the US has written for a year. We assume that there are 250 million people in the US and they write an average of 10k per day. That works out to be about 850 terabytes. Also assume that indexing a terabyte can be done now for a reasonable cost. We also assume that the indexing methods used over the text are linear, or nearly linear in their complexity. Given all these assumptions we can compute how long it would take before we could index our 850 terabytes for a reasonable cost assuming certain growth factors. Moore's Law was defined in 1965 as a doubling every 18 months in processor power. It has held remarkably true, not just for processors, but for other important system parameters such as disk as well. If we assume that Moore's law holds for the future, we need only 10 more doublings, or 15 years to reach our goal of indexing everything everyone in the US has written for a year for a price that a small company could afford. Of course, hardware experts are somewhat concerned Moore's Law may not continue to hold for the next 15 years, but there are certainly a lot of interesting centralized applications even if we only get part of the way to our hypothetical example.
Of course a distributed systems like Gloss [Gravano 94] or Harvest will often be the most efficient and elegant technical solution for indexing, but it seems difficult to convince the world to use these systems because of the high administration costs of setting up large numbers of installations. Of course, it is quite likely that reducing the administration cost drastically is possible. If that happens, and everyone starts running a distributed indexing system, searching would certainly improve drastically.
Because humans can only type or speak a finite amount, and as computers continue improving, text indexing will scale even better than it does now. Of course there could be an infinite amount of machine generated content, but just indexing huge amounts of human generated content seems tremendously useful. So we are optimistic that our centralized web search engine architecture will improve in its ability to cover the pertinent text information over time and that there is a bright future for search.
From author David A. Utter of WebProNews
http://www.webpronews.com/topnews/topnews/wpn-60-20060721HowGoogleCanFindYourSecretPage.htmlAmazingly enough, some webmasters haven't learned about Google yet, and how easy it is to retrieve pages that have been poorly
protected from being viewed.
When the blogger behind the brand new
EvolvedLight blog wanted to find out more information regarding an
accident at Alton Towers amusement park in Staffordshire,
England, the quest for information led to the park's
media page.
"This site is for Media use only. To gain an access password please call 01538 704015," reads the page. Instead, the blogger turned
to the ubiquitous Google to indulge in a little Google hacking.
In looking at the source code, one section revealed that whatever is entered as a password would trigger a redirect to a page
named {password}.html. The right password would reveal the press page.
So the blogger sent Google a simple search string: *
site:http://press.altontowers.com and guess what was revealed as the third result in the SERPs?
"
Welcome
to the Alton Towers Press Site," said the revealed page, called pressxpsa.html. That means the password would be
pressxpsa.
And indeed it is. To call this a poorly designed page would be an insult to poorly designed pages everywhere.
In the interest of helping out someone in need, here is a Microsoft link on
securing ASP pages for the amusement park's Windows Server
2003
host running IIS 6.
From Microsoft.com: "Microsoft SQL Server 2005 Everywhere Edition offers essential relational database functionality in a compact footprint ideal for embedding in mobile and desktop applications including a new generation of occasionally connected dynamic applications." I'm not sure I agree.
Here is my inital list of pros/cons from a developers point of view:
Pros
Completely run in-proc, meaning there is no installation required
Free to develop, deploy and redistribute
Allows up to 4GB databases
Support for up to 256 connections
Very compact (max of 7 DLLs required at 1.4 MB)
ConsNo user interface available (I had to create my database through code)...I'm sure a 3rd party will develop one very soon.
Support for only a limited number of datatypes - for example, supports nvarchar, but not varchar or the XML data type!
Does not support stored procedures, views, triggers, extended stored procedures, or macros.
Sample Application
The code is not that complicated, so I'm not going to write up a technical explanation. But in my overall opinion, this product seems to be a much more powerful version of Access without the interface. I don't see myself using it anytime soon.
I spent a few hours reading about SQL Server Everywhere and put together this sample application.
Download Sample ApplicationForm1.cs
Useful linksDownloaded SQL Server Everywhere:
http://www.microsoft.com/sql/ctp_sqleverywhere.mspxPaul Flessner announced SQL Server Everywhere:
http://www.microsoft.com/sql/letter.mspxSteve Lasker's Blog - Interview with Paul Flessner:
http://blogs.msdn.com/stevelasker/archive/2006/04/10/SqlEverywhereInfo.aspx
Over the past 10 years, I have spent a good amount of time teaching software development. First in college, I was a teaching assistant at NorthernIllinoisUniversity. Then after graduating, I worked for a couple companies and did some internal training. Over the years, I have helped a few friends get into the IT industry…taught them the basics and helped get their foot in the door of a new career. I now provide training for companies on a consultant basis.
I have learned a lot from people, books, magazines, conferences and online…and continue to learn more everyday. At least once a week, I get emails from people who are getting started in software development. Some people have questions while others just thank me for the articles that I publish. Beginning developers often ask something like, “What do you recommend I do so I can become a full time web developer?” More experienced developers often ask what I would recommend to help advance their careers or become an independent contractor.
These answers are not easy…it depends on your experience, your available time, and most importantly, your drive. But I took the time to put together what I think is important. The first 5 are for those of you just getting started and the last 5 are for the life of your career.
-
Find Good Resources – Whether it’s a friend, colleague, website forum or some online community, try to find at least one good resource. When I graduated college, most of my skills were mainframe-based (Cobol, JCL, etc.). I immediately got a job with IBM but I really wanted to do web development. So I bought a few ASP and Visual Basic 6.0 books, read through them and did the examples. It all made sense, but I wanted to get some practical experience. I was consulting at Allstate Insurance at the time (mainframe area) and had a great deal of knowledge of their systems. Allstate had made the decision to get on the web at that time (May 2000) and I was able to get a position as a business knowledge expert. As the project progressed, I became friends with several of the developers and had great people to bounce questions off of. Over the next 6 months, I learned more than ever…they told me what to learn now, what book chapters to skip for now, what was important, what was not important. Everything became very clear, I developed a couple ecommerce sites (including Cigars Around the World’s first true database-driven, ecommerce site) and earned my MSCD in VB 6.0. This may be the most important recommendation for getting started…but be careful, there are a ton of hacks out there and you do not want their advice!
-
Learn the Basics – 75% of the code you write can be written in any language. Your most common tasks will include declaring variables, assigning values, performing calculations, looping through data, if…then…else, functions, etc. Every language supports these features and they must be mastered. Any “Introduction” book will cover these features in very good detail. If a book has at least a dozen or so reviews, see what the consensus says. Better yet, a good resource should be able to recommend a good starter book. I also highly recommend buying an introduction book for the language you plan on developing in. You might not know what that is, but you should have a good idea. You want to always be reading in the language you are learning…this may seem obvious, but I know people who want to learn C#, but are reading a C++ book because their brother had one laying around. If you want to learn C#, make the investment of buying a good C# book. And don’t worry too much about choosing a language to start with…as I said above, the basics translate to any language…it’s just syntax differences.
-
Keep It Simple – I have a friend who says this all the time in meetings where we’re discussing how we want to solve a particular issue. That’s all he says…really. Most people kind of laugh at it. I used to, but then I thought about it a little and now I consciously practice it. If you put 1,000 developers in a room and gave them even a simple coding task, not one person would code it the same. The question is, would everyone be able to read your code…and if you looked at it a few months later, would you be able to quickly read and understand it? There are a million ways to solve any problem; the best approach is to keep it simple. It makes it much easier to maintain and also decreases the chance of someone else breaking it (including yourself!).
-
Learn By Doing – Reading and comprehending a subject isn’t that difficult….but can you really apply what you just learned? I have tried to get through books very fast and skip over some of the “Chapter Tasks” at the end. The next day, I couldn’t tell you much about what I had read. That’s why I make a point of immediately performing all of the exercises in a book. When I’m teaching a friend programming (as I am right now), we typically work side by side on separate computers. We both do everything I’m teaching and he’s learning. I could just sit there are type everything as he watches, but it has been my experience that it never works. The more time you spend in the software development environment (i.e. Visual Studio.NET, SQL Server Management Studio, etc.), the more comfortable you will get.
-
Plan your Code, Code your Plan – One of the biggest mistake developers make is to start coding right away. Big mistake! I cannot express that enough! People get all excited about creating an interface, designing the logo, blah, blah, blah. Yeah, that’s fun, but it’s not the correct approach to take. There are entire courses on proper systems design and architecture (which are definitely worthwhile!), but the basic approach to a System Development Life Cycle (SDLC) is this.
-
Project planning and feasibility study: Establish a high-level view of the project and determines its goals.
-
Systems analysis and requirements definition: Refine goals into defined functions. Analyze end-user needs.
-
Systems design: Describe desired features in detail, including screen layouts, business rules, process diagrams and other documentation.
-
Implementation: Write the code. Personally, I design the database first,
-
Testing: Check for errors, bugs and interoperability.
-
Deployment: The final stage of the initial development where the software is put into production and runs the actual business.
-
Maintenance: The rest of the software's life: changes, correction, additions, moves to a different platforms, etc. This, the least glamorous and perhaps most important step of all, goes on seemingly forever.
It is so important that you follow these or some other SDLC approach. The amount of upfront planning and preparation will save you tons of time in the long run.
-
Don’t be lazy! – Often, you might seem under a lot of pressure to get something done very quickly…it is very important to never compromise the quality of the software to get something done fast. Even worse than that are people who take the shortcut or “band-aid” code just to be done with it. These are the worst developers and are not respected. If you there is any chance a user might do something on your website, be sure to account for it…because they will do it! I can think of at least a half dozen cases where a fellow developer did not code for some scenario because “there’s no way anyone would ever do that”. And more times than not, some user somewhere did it…and I’ve seen entire websites crash because of it (and people do get fired). The developer always blames the user for being stupid and doing what should not have been done, when in fact the developer is the real idiot. Don’t be lazy…don’t be a hack…do it right.
-
Put First Things First - One of the best books I have ever read (and read at least once every year) is Stephen Covey’s “7 Habit’s of Highly Effective People”. He says in Habit 3: Put First Things First, “The key is not to prioritize your schedule, but to schedule your priorities. Do the most important things first – because where you are headed is more important than how fast you are going”. He also says of all of the 7 habits, this is the hardest one to master. I completely agree…it is so easy to work on the fun tasks or prioritize what may seem urgent over working on the not-so-fun things than require time and serious thought. But what’s important should always take precedence over what’s considered urgent. Self discipline can be difficult and you have to realize these tasks you’re pushing aside for another time will never go away. You have to do them and you have to make them a priority. In Covey’s book, he also referenced a lifelong study on what the common trait among successful people is. The answer: successful people know to “Put First Things First”. This may be the most important recommendation you’re your continued career.
-
Reuse, Reuse, Reuse – You should never need to rewrite the same logic of code ever. Does your application send emails out? If so, you better have one “SendEmail” method that everyone uses. Do you query the database for the current specials to be displayed on every page? You better have that method encapsulated in a database tier and every page better be getting the data through that method (Better yet, you better be caching that data to eliminate the extra database queries!). Whenever I start coding a new website (after the database has been designed), the first thing I do is add in my “Common” code. This common code takes care of all of the tasks common to every website, which includes validation, javascript and form helpers, constants, enumerations, exception handling, emailing, base classes, and much more. At least 10-15% of my code is now done and I know it works perfectly. The next thing I do is generate my data access tier. I always write my data access code the same from project to project: I create custom classes to represent data entities, I pass all data from tier to tier through these custom classes and I use the Microsoft Data Access Blocks to do the database access. It’s really time consuming to create the custom classes, CRUD methods and the stored procedures, so I create my own data tier generator. With the data tier generator, I click one button and everything is generated in a couple of seconds…hours and hours of work now done instantaneously and I know the code is perfect. Work smart, not hard.
-
Certifications – Get certified or not? This has been a debate among fellow developers for a long time. My personal belief is to get certified, but it’s not priority #1. My reasons include:
-
I have interviewed for contracts where they only accepted resumes from people who were at least an MCSD (Microsoft Certified Solution Developer).
-
I have taken several tests and they are not easy…they do require a very good deal of knowledge and understanding in order to pass.
-
Some companies give bonuses or raises for obtaining your certification
-
Instant credibility – Only those who do not believe in the program will not care…the other 99% who know what the certification is or do not know what it is have an instant feeling that you know what you’re doing.
Most people I speak to agree getting certified is a good idea. Some are indifferent and a few completely disagree. The main disagreement is that anyone can get certified. They’re just tests and there are “brain dump” websites out there where people post questions and answers right after taking the tests. These are true statements, but almost every interview I have been on for a contract or short term assignment have looked very positive on my certification. I certainly believe getting certified has made my career as an independent contractor a lot easier…and it obviously cannot hurt it.
-
Continued Education – This is what separates the real developers from the 9-5’ers. Most programmers get by with the basics. They know enough to get the job done, work their 8 hours every day and go home to do nothing. Those with drive and ambition to be great continue their education everyday…even at home. Whether it’s reading technical magazines, online articles, blogs, or picking up a good book, anything you do to stay on top of the latest technology will give you a huge advantage. Not only will you continue to work with the latest software, but you will also be much more sought after, have much better job security and be able to bill a much higher rate. I personally subscribe to 4 technical magazines and a few weekly newsletters. I also try to read one good book per month. By doing this, I am always reading about the newest software and have learned great approaches to difficult solutions (which saves me a ton of time and work). My favorite site is The Code Project. They send out a weekly newsletter that lists the recent articles by category. They enforce an excellent format that is easy to follow and provide an area for feedback. When searching for a way to do something specific, I often go here first and almost always find it. Whatever approach you take, make a conscious effort to keep learning. By being smarter and better than the rest, you have complete control of your career…which leads to more opportunities…which leads to a very enjoyable life.
While many IT shops deal with the issues related to hundreds or thousands of users, Google regularly manages, administers and upgrades software and systems for millions of users. How do they do it? I found this article with an excellent overview of the Google file system and its unique capabilities.
http://storagemojo.com/?page_id=152
I've seen a lot of people have problems getting their image, css and js paths setup correctly in ASP.NET 2.0. In 2.0, every site runs on its own port just off the localhost, but the "virtual directory" name is still in the path...so you can't force everything down to the root. If you do any type of URL Rewriting or use Master Pages in pages in different level subfolders, this is going to be an issue. Prefixing paths with '~/' works sometime, but not for all cases.
In 1.1, I would often set the webinfo file contents to point to http://localhost. Why not do the same in 2.0? To do so, open your web project Property Pages, click into the 'Start Options' area, select 'Use custom server' and enter in 'http://localhost'. This will set your environment up as if it were running on your server as a website. Only pain is changing the IIS default directory when working on another site...but that only takes a couple of seconds.
For the past 2 weeks, I have been giving out free licenses for the Google Sitemap Generator. Each day, I'd send out between 30 and 40 licenses. Yesterday, I released version 2.1 and stated "Last day to get a free license is 6/14/2006.". Well that have some effect.
Today's stats:Unique visitors: 624 (normally 230-250)
Free licenses: 141 (normally 30-40).
And I am still allowing up to 11:59 PM CST.
I guess when you set a deadline for something free, people finally get off their butts and ask. And I'm sure solme of these people told their friends.
As of tomorrrow, it'll cost $49.99. So for all of you who want to thank me for the great software and free license, tell you friends about it and tell them to buy a copy. Friendly message: I track all usage of the application, so don't try to give your friends your license or you'll lose it!
Has anyone seen this yet?
http://local.live.com/Google Maps has some competiton.
Last week, a friend told me about this new product available for free from Red Gate Software, SQL Prompt. Since then, I've seen several blogs talking about it.
What is it?
Basically, it's Intellisense for SQL Server. So when you start typing "SELECT * FROM", it automatically pops up a window with the available tables, fields, functions, etc. It also reacts just like Visual Studio (Ctrl-Space opens the window, Tab places the selected entry, etc.) and has a ton of options (just open the Options window from the System Tray).
I thought it would only work in Query Analyzer, but nope, it works in Enterprise Manager, Management Studio, Visual Studio and even some 3rd party apps. I downloaded it myself and installed it. I've tested it with SQL Server 2000 and 2005 and it's great. Check it out and get it while it's free (until Sept 1st)!
Website: Red Gate Software, SQL Prompt
I was doing some research on Web Services and came across this webpage. It's Chapter 13 from Thomas Erl's book on Service-Oriented Architecture is about integrating web services.
It's great for people who know little to nothing about Web Services and want to learn. And will teach even those of us who know a lot of Web Services already.
Check it out:
PerfectXML website.
I don't use XPath enough to memorize all of the possible syntax formats. If you do, get a life! Anyways...the other day I needed to find an example of a complex XPath expression and came across this page on MSDN. It lists the commonly used XPath expressions and is invaluable. Enjoy!
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/xmlsdk/html/1431789e-c545-4765-8c09-3057e07d3041.asp
I've been working in ASP.NET 2.0 for a month or so now, but just pushed my first application to my production servers. Website:
http://www.midwestscubadiving.comThe homepage loaded up just fine, but then I clicked on "Login" and got the following message: Cannot convert type 'ASP.login_aspx' to 'System.Web.UI.WebControls.Login'
Looking at the message, I realized the word "Login" appears to be some type of reserved type. New in ASP.NET 2.0 are the built in "Login" controls. I'm guessing there must be some conflict. I didn't want to rename my page, so I searched around a little. Someone suggested unchecking "Allow this precompiled site to be updatable". I tried that and it worked...and I don't mind not allowing the site to be updatable. This will prevent me from making small tweaks on the Production server without making them locally!
I came across the same issue with the "Profile.aspx" page. I decided to rename this page as the above resolution didn't work...and I'll probably end up renaming "Login.aspx".
Has anyone else come across this and found a better resolution?
By default, Web Parts uses the SQL 2005 Express provider for data storage.
To use SQL Server 2000 or SQL Server 2005 (Non-Express), you need to do the following:
1. Use the aspnet_regsql.exe tool located in
C:\WINDOWS\Microsoft.NET\Framework\v2.0.50215\ to prepare SQL Server.
Basically, it will add the 'aspnetdb' database to your SQL Instance.
2. In your Web.config, add the following:
<connectionStrings>
<add name="ConnectionString"
connectionString="Data Source=your_server_name;
Initial Catalog=aspnetdb;Integrated Security=True"
providerName="System.Data.SqlClient" />
</connectionStrings>
<system.web>
.
.
<webParts>
<personalization defaultProvider="SqlPersonalizationProvider">
<providers>
<add name="SqlPersonalizationProvider"
type="System.Web.UI.WebControls.WebParts.SqlPersonalizationProvider"
connectionStringName="ConnectionString" applicationName="/" />
</providers>
<authorization>
<deny users="*" verbs="enterSharedScope" />
<allow users="*" verbs="modifyState" />
</authorization>
</personalization>
</webParts>
</system.web>That's it...just create a new Web Application and start adding in Web Parts.
Lately, I've been spending a lot of time in Visual Studio 2005. A cool new feature is code snippets. They are reusable, task-oriented blocks of code available in the code editor. They are available to download online, you can search for new ones through Visual Studio 2005 or you can create your own. Just yesterday, I came across this link:
http://msdn.microsoft.com/vstudio/downloads/codesnippets/default.aspxMicrosoft published dozens of code snippets that install into VS 2005. Check them out.
Yesterday, a friend told me about TopDesk by Otaku Software. I downloaded the trial version, installed it and messed around with it a little. If you haven't seen this yet, it's pretty cool. What is it? According to their
website: "
Find windows, fast. TopDesk is a quick and easy way to switch between applications. With a single key press, you can instantly view thumbnails of all open windows, display thumbnails of windows belonging to the current application, or hide all windows to quickly access the desktop." Basically, it's a replacement for ALT-TAB in Windows.
Immediately, I starting thinking of how to code something like this. At first thought, I believe I'd need:
1. An application running in the background t
- Capture windows and cache them for quick display
- Intercept the ALT-TAB keystroke combination
- Display the open windows (tiled, cascaded, spatial)
- Allow tabbing across windows or use of mouse
2. A SysTray icon to a setup window t
- Makes changing options easy
- Make enabling/disabling the application easy
3. A registry setting to load the application upon Windows start up.
So today, I'm starting to work on recreating this in C#. Stay tuned for periodic updates and email me if you have any suggestions or questions.
Over the past few years, I have written many applications that are stand-alone executables (Win Forms) that manage all of the business' information locally and then sync up with a web server (real time or periodically). These applications (inventory, eBay-related, apartment finders, photography studios, etc.) will sync information through secure web services but often need to FTP files (PDFs, images, etc.). To do so, I have a C# FTP class I wrote awhile ago and have modified over the years.
Recently, I changed hosting providers. I had a dedicated server with Verio (San Jose Data Center) for about 5 years and just moved over to Interland (Georgia Data Center) about 6 months ago. My old servers had Windows 2000 Server and I was running ISS Black Ice firewall. It worked great and I had planned to use it on my new servers...unfortunately, Black Ice doesn't support Windows 2003 Server (at least not officially). Well that's not something I can hope works.
So what firewall do I go with? I spoke with some colleagues who run data centers and of course, they wanted me to go with a hardware solution (i.e. Cisco PIX). Interland wants $200 setup + $200 per month...a little pricey. As for software solutions, Microsoft's ISA (Internet Security and Accelerator) Server 2004 lead the reviews, but also costs $1,500. According to my colleagues, ISA should run on its own box...running with SQL Server, IIS 6.0, Email, etc. might be a bit too much for the servers. Another option is the firewall built into Windows Server 2003. It's a very basic firewall that either opens a port or does not open a port. In that sense, it's great if you have very basic rules...but it's meant for single servers only.
Since this was my case, I decided to go with the Windows Server 2003 built-in firewall. Almost immediately, I found out my applications could not FTP images to the server...it could login, change directories and create directories, but could not "put" files on the server. So quickly I learned about Active vs Passive FTP.
Basically, Active FTP connects to the FTP server's command port (21) and transfers data to the FTP server's data port (20). Well that's what I thought I was doing, but with both ports open it still didn't work. Then I reviewed my code and realized I was issuing the "PASV" command which puts the code into Passive mode. Passive FTP also connects to FTP server's command port (21), but transfers data to a randomly negotiated port (> 1024) on the FTP server. The problem here is that all of these ports are closed by the firewall so it can;t work.
So how do I solve this problem? a) I could add code to use Active FTP, but I don't have the time...plus it appears to be very difficult as very few people have done it. b) Purchase a third-party component (Rebex seems to be best) for about $250...eh. c) Figure out a way to get the data through the firewall with the current software.
Obviously, choice "c" is preferable, so I started to look around. After searching all over, I finally found a post where someone figured it out...here's what you do:
1. Open the Control Panel and activate the Windows Firewall control.
2. Click on the Advanced tab, select the network interface that the FTP is bound to and make sure that this option is checked to enable the firewall for this interface.
3. Click on the "Settings..." button.
4. Click on the Services tab and CLEAR the check box for the "FTP Server" option. I know this makes no sense, but neither does the conflict between server and firewall!
5. Click on the "OK" button to close the Advanced Settings dialog.
6. Click on the Exceptions tab, then click on the "Add Program" button.
7. Browse to "C:\WINDOWS\SYSTEM32\INETSRV\inetinfo.exe" and double click on this file to select it.
8. You may want to use the "Change Scope..." button to narrow the range of IP addresses that can contact the FTP server.
9. Click on the "OK" button to close the Add a Program dialog.
10. Click on the "OK" button to close the Windows Firewall control.
11. You may need to reboot for the changes to take effect since "inetinfo.exe" runs as a service. I just restarted IIS.

When communicating over the web (web services or .NET Remoting), you must remember that the most expensive part of the operation comes when you transfer objects between distant machines. The more granular the API is, the higher percentage of time your application spends waiting for data to return from the server. Therefore, be sure to create web-based interfaces based on serializing documents or sets of objects between client and server. This way, the server receives all of the information it needs to complete the requested task.
Here's an example of how
NOT to code a web-based interface:
1//Create CartItem object on the server
2CartItem objCart = new Server.CartItem();
3//Round trip to set the ProductID
4objCart.ProductID = 135;
5//Round trip to set the Quantity
6objCart.Quantity = 1;
7//Round trip to add the item
8objCart.AddItem();
Here's an example of a well designed web-based interface:
1//Create CartItem object on the client
2CartItem objCart = new CartItem();
3//Set local copy
4objCart.ProductID = 135;
5//Set local copy
6objCart.Quantity = 1;
7//One round trip to add the item
8Server.AddItem(objCart);
The above example is a simple example. To really make this more efficient, you need to apply this to real world scenarios and further examine what's being transmitted back and forth. For example, let's pretend you're writing a software system for an order intake company who has a few millions customers who each place 15-20 orders per year. You staff consist of 20 order operators plus a dozen or so people running reports or simply querying the database.
When a customer calls, you might want to retrieve their orders so you create the following method:
1public OrderData FindOrders(string strCustomerName)
2{
3 //Return all orders for a customer searched by name
4}That's OK, but why return all orders...how about just open orders. So we change it t
1public OrderData FindOpenOrders(string strCustomerName)
2{
3 //Return all open orders for a customer searched by name
4}Better, but we're still requiring 2 data transmissions per phone call (one get, one save). Let's assume we could partition the call center into regions/states. At the beginning of the operator's shift he/she could retrieve all customers (with open orders) and open orders for the given region. As calls come in, the operator would never need to retrieve any information from the server. Once the call is completed, the operator could push the updates to the server and at the same time, retrieve any updates made since the last update (only 2-way transmission).
But it can get even better! What if you only retrieved customers who have made purchases in the past 6 months. Anyone else probably won't be coming back and if so, we could always make a quick trip to get their data (well worth the saving on the initial data retrieval!).
These are just a few reasons why...to read more, pick up "Effective C#: 50 Specific Ways to Improve Your C#" by Bill Wagner.
You can buy it at the Addison-Wesley website:
http://www.aw-bc.com/
Callbacks are used to provide feedback from a server to a client asynchronously. They might involve multithreading, or they might simply provide an entry point for synchronous updates. Callbacks are expressed using delegates in the C# language.
Delegates provide type-safe callback definitions. Although the most common use of delegates is events, that should not be the only time you use this language feature. Delegates let you configure the target at runtime and notify multiple clients. A delegate is an object that contains a reference to a method. When performing multicast delegation, be sure to invoke each delegate target yourself. Each delegate you create should contain a list of delegates. To examine the chain yourself and call each one, iterate the invocation list yourself like s
1public delegate bool ContinueProcessing();
2
3public void LengthyOperation(ContinueProcessing pred)
4{
5 bool bContinue = true;
6
7 foreach (ComplicatedClass cl in mobjContainer)
8 {
9 cl.DoLenghtyOperation();
10 foreach (ContinueProcessing pr in pred.GetInvocationList())
11 {
12 bContinue &= pr();
13 if (!bContinue)
14 return;
15 }
16 }
17}In the above example, I've defined the semantics so that each delegate must be true for the iteration to continue.
These are just a few reasons why...to read more, pick up "Effective C#: 50 Specific Ways to Improve Your C#" by Bill Wagner
You can buy it at the Addison-Wesley website:
http://www.aw-bc.com/
Writing constructors is often a repetitive task. Many developers write the first constructor and then copy and paste the code into another constrcutor to satisfy the multiple overrides defined in the class interface. Here's an example:
1public class Menu
2{
3 //private variables
4 private string mstrName;
5 private string[] marrItems;
6
7 public Menu() : this("", 0)
8 {
9 }
10
11 public Menu(int intNumItems) : this("", intNumItems)
12 {
13 }
14
15 public Menu(string strName, int intNumItems)
16 {
17 marrItems = (intNumItems > 0) ? new string[intNumItems] : null;
18 mstrName = strName;
19 }
20} This is not a good idea. A better approach is constrcutor chaining where you create a common method to do the initializing. Here's a better approach:
1public class Menu
2{
3 //private variables
4 private string mstrName;
5 private string[] marrItems;
6
7 public Menu()
8 {
9 MenuConstructor("", 0);
10 }
11
12 public Menu(int intNumItems)
13 {
14 MenuConstructor("", intNumItems);
15 }
16
17 public Menu(string strName, int intNumItems)
18 {
19 MenuConstructor(strName, intNumItems);
20 }
21
22 private void MenuConstructor(string strName, int intNumItems)
23 {
24 marrItems = (intNumItems > 0) ? new string[intNumItems] : null;
25 mstrName = strName;
26 }
27}The second approach generates far more efficient code. In the first example, the compiler adds code to perform several functions on your behalf in constructors. It adds statements for all variable initializers and calls the base class constructor...this is a very big difference! Also consider readonly constants. By nature, they can only be set in the constructor. By centralizing this action, we avoid more redundancy.
These are just a few reasons why...to read more, pick up "Effective C#: 50 Specific Ways to Improve Your C#" by Bill Wagner.
You can buy it at the Addison-Wesley website:
http://www.aw-bc.com/
The C# foreach statement generates the best iteration code for any collection you have. Examine these three loops:
int[] foo = new int[100]
//Loop 1
foreach (int i in foo)
Console.Write(i.ToString());
//Loop 2
for (int i = 0; i < foo.Length; i++)
Console.Write(i.ToString());
//Loop 3
int i = foo.Length;
for (int j = 0; j < i; j++)
Console.Write(foo[j].ToString());
For the current and future C# compilers (version 1.1 and up), loop 1 is the best. It's even less typing so productivity is also better. Note: The C# 1.0 compiler produced much slower code for loop 1, so loop 2 is the best in that version. By moving the "Length" variable out of the loop, you make a change that hinders the JIT compler's chance to remove range checking inside the loop.
Loop 3 is the worst...the CLR guarantees that you cannot write code that overruns the memory your variables own. The runtime generates a test of the actual array bounds (not the "i" variable) before accessing each particular array element. You are now forcing the runtime to check the array index on every loop!
Loop 1 is better than Loop 2 because you allow the compiler to check the upper and lower bounds. Some people still believe index variables start at 1, not 0. Loop 2 forces you to know the lower bound, whereas Loop 1 does the work for you.
Custom objects/types: foreach allows you and your users to iterate across members if you support the .NET environment's rules for a collection.
These are just a few reasons why...to read more, pick up "Effective C#: 50 Specific Ways to Improve Your C#" by Bill Wagner.
You can buy it at the Addison-Wesley website:
http://www.aw-bc.com/
If you're still creating public variables in your types, stop now. You should be using properties as they enable you to create an interface that acts like data access, but still has all of the benefits of a method. They also provide encapsulation, something you want as an object-oriented developer.
The .NET Framework assumes you'll use properties for your public data members. The data binding code classes support properties but do not support public data members. For example:
txtLastName.DataBindings.Add("Text", Employee, "LastName");
This example bind the Text property of "txtLastName" TextBox control to the "LastName" property of the "Employee" object.
As you already know, properties allow you to apply rules as to what data can be applied to your private variables. Properties allow you to apply those rules in one location...much easier to update in the future.
These are just a few reasons why...to read more, pick up "Effective C#: 50 Specific Ways to Improve Your C#" by Bill Wagner.
You can buy it at the Addison-Wesley website:
http://www.aw-bc.com/