InsideGoogle: Finally, An Answer To "Is Google Broken?"

Thursday, September 09, 2004

Finally, An Answer To "Is Google Broken?"

Google is not broken.

There has been much discussion on the subject of Google being broken. I've brought it up before, I've discussed the experts involved, we've even been visited by noted Google critic Daniel Brandt. Now, courtesy of the forums at Search Engine Watch and the good people at Search Engine Roundtable, we seem to have our answer.

The problem arose from the fact that the listing on Google's home page has read "Searching 4,285,199,774 web pages" for a little over a year now. Brandt, and later, Anthony Frederico, discovered that this was most likely related to the fact that Google uses a 4-byte unsigned long integer to mark its pages, which has a limit of 4,294,967,295 possibilities. So, the theory went, Google ran out of ID numbers for its pages, and could no longer index the whole web.

In fact, it seems, Google used one of the many workarounds available for 4-byte ULI. A search on the word "the" returns 5,800,000,000 results, more than the Google home page says. Google's home page reads the old number, because Google can't link that to its database, since it is using a workaround. As a result, the number is not real-time, but represents the last number before the update that pushed past the 4-byte ULI limit.

As "cariboo" explains in this thread:

This ID problem is an urban legend. A low skilled techie can solve it without any problems, and Google have a big team of PhD's. In fact, Google has a special "know how" about building huge "scalable" systems, with hundreds of machines working together... Gmail is their last demonstration of this know how.

And he says something that may explain why some sites are listed in Google but without page info. The gist: Google used to crawl the entire web, now it only crawls pages that are updated every so often. It keeps crawling the whole web with its old batch crawler, but it uses the new incremental crawler for most pages. It went live in July of 2003. If your site doesn't have page info (and this is my theory), it could be because the batch crawler caught you a while ago but the incremental crawler has no record. Are there any constantly updated sites that show this problem? Of course, it could be a different kind of bug.

There's your answer folks. Hopefully this will satisfy most people.

- posted by Nathan Weinberg @ 9/09/2004 01:43:00 PM

Comments:

Fascinating!

# posted by

Sister Sunshine : 2:10 PM

Keep in mind, the number is rounded off. There aren't exactly 5.76 million pages with "the" in them. Google rounds off number that high, always.

# posted by

Nathan Weinberg : 4:58 PM

You say the article "Is Google Broken?" has no evidence, but I have to completely disagree with you on that. I don't know if you have been following the many posts in the article written at w3reports.com or not, but Anthony Federico has provided some compelling support written by the google founders and can be downloaded directly from stanford.edu! Not only does he support this 4 byte limit, but he has shown and uncovered a huge amount of evidence on several other google odities that I myself have noticed over the past several months with my own website.

This guy has not only provided written support, but has also supplied examples of real search results from google's website. I think google should respond don't you? Who cares if google hit the limit or not? I want answers to all the other problems that Anthony Federico has so kindly provided us support for. I don't know about you, but I'll be following this post closely. I also don't care who Anthony Federico is. If he is a rival of google then I don't know why he didn't plug his own website in the articles. It appears he is not looking to promote anything to me.

http://www.livejournal.com/community/insidegoogle/28037.html?mode=reply

Granted the google publications he points to are old, but when was Windows XP released? Has it ever worked right? And it's backed by one of the largest companies in the world with quite possibly the deepest pockets and most brilliant minds. Heck I just installed another 75MB patch the other day.

Could everything have changed at Google and they still had the time develop adwords, adsense and gmail? I don't know about that one.

It's not easy to re-engineer something even if it is as simple as googles search engine. Plus if you go to google's job postings you will find that some jobs link to these same reports as a prerequesite. So if they are outdated I don't know why they would provide this information to future employees.

You say the sites he lists are in google. I guess I don't know what you're reading, but maybe you missed the actual urls he posted. If you do the searches like he shows you would see what he's talking about.

[google search term] site:www.liberty72.com

www.liberty72.com/L72_HT.html
Similar pages

www.liberty72.com/L72_contact.html
Similar pages

they are empty in the results at google just like he said. I did the same kind of search on my own site and found all kinds of empty pages at google not showing my titles or any content. This is very alarming to me and might explain why my traffic has decreased in the past 8 months.

On David's pages and in the post he made to the article he does plug NameBase. He gives a link to http://www.google-watch.org/dying.html which is all about NameBase.

I can't agree with you more about being careful what you post. You have to do your research that is for sure. But like you said, Anthony Federico did say google may have changed this 4 bite stuff, but it doesn't change the evidence he presents or the other google flaws.

I didn't realize this was a press release site. The headings on the pages says "NEWS FOR WEBMASTERS". Even on their home page it says it is a news article site. I now wonder if we are visiting the same site?

Anyway, thanks for your time,
John

I sit here reading all this and I have a hard time believing Google would be so stupid and run out of space with this 4-byte interger thing. Although at the same time, how could a site like Google continue to display on their home page 4, for several months now without a change? What I find really compelling about your article is that Google was able to find the time to create a new Google logo every single day during the Olypics? I guess if they can make a stupid mistake like this, they can certainly have a problem with running out of space.

I also want to add that all the information you presented here is excellent. All the Google bugs you show make sense and I have tested my own site and found the same kinds of problems. I have pages that were created back in 1998 that are content rich with NO keyword hammering or any sign of spam that were always in Google's index and about February of this year, Google for no reason at all, now only shows the URLs in the results pages. I have contacted Google several times and they just give the same canned response stating that my pages have not been fully fetched. But this is complete nonsense because of 6 years they were in Google and now they are empty pages.

I also check my server log files daily and Google visits these pages sometimes twice in one day and they have done so for 6 months. It's not a firewall problem or something were we are blocking Google because Google seems to find and index several other pages on the site. These pages have no duplicates either. Google is just really messed up and I thank you for exposing these problems to the masses.

#####################################################

I think this 5,760,000,000 figure only proves Google is broken.

-allintitle: +the

which means the word "the" is NOT to be part of the TITLE. However Google clearly returns pages with the word "the" in the TITLE. And to further complicate this query we can try any one of these too:

search for allinurl: the = 5,760,000,000
search for allintitle: the = 5,760,000,000
search for allintext: the = 5,760,000,000

So the word "the" seems to be a bogus query and so is the 5,760,000,000 returned results.

How can you write about Google when your own site isn't even listed in Google?

site:insidegoogle.blogspot.com

The article written at http://www.w3reports.com/index.php?itemid=549 says the docid issue "could" be the reason why "Google is broken". But the article is not about the docid like so many Google lovers want to believe, it is about Google being broken and it contains some startling evidence. The author of the article asks these questions:

1. Why after several months does Google proudly display 4,285,199,774 web pages, but yet they seem to have the time to update their Logos on a daily basis?

2. Why are still valid and active pages dropped after being in Google's index for years?

3. Why does Google give us results for empty pages of the same URLs for months at a time?

4. Why do empty pages Google claims to not have indexed and are empty results, rank higher than pages which have been indexed?

5. Why does Google crawl sites that are clearly restricted to robots?

6. Why does Google include URLs in their SERPs that again, are restricted from all robots including Google? This only hurts the quality of search results.

And we should add one more to this.

7. Why does Google not index pages at insidegoogle.blogspot.com? You can't say no one is linking to you so Google hasn't discovered these pages.

So far all the posts I've found on the Net defending Google does not answer any of the questions presented in this article nor does anyone provide any documentation or concrete proof for their theories of why Google is broken. Like this article:

http://www.searchguild.com/article215.html

which is completely nonsense. It says Google doesn't change the number of pages indexed on their home page because Google doesn't have to? That's got to be the stupidest answer I have ever heard. Plus his 8-byte theory seems to have been shot down with a comment by a "Mr. Yuan" at:

http://www.w3reports.com/index.php?itemid=549

But let's just assume Google isn't broken and the docid is not and has never been an issue. Now that this is out of the way we can concentrate on the issues instead. Some so called SEO experts defend Google because they are not smart enough to answer the questions why. They have no real theory of their own. Let's have a look at a so called expert's site in Google:

site:searchguild.com

This query returns about 56,100. And as you can see this will find anything on the site including URLs that contains searchguild.com right? And you'll notice all kinds of empty URLs right? Let's just forget about the empty URLs and concentrate on the figure of 56,100. Let's change our search just a bit:

site:searchguild.com searchguild.com

This will return about 68,900. Now that doesn't make sense at all to me. How can this more defined query present more results than an empty site query? Again, you can't rely on the total number of results shown by Google in the results pages so put that 5.7 billion figure to rest.

I use to love Google prior to 2003. Somewhere in late 2002 and late 2003 Google fell apart. I'm sure this all has to do with their re-writing of PageRank because PageRank was not owned by Google (owned by Stanford). In order for Google to go public they had to rewrite this algo and in doing so, screwed up Google.

I feel people like Chris Ridings should be and be part of the movement to get answers instead of posting ridiculous content trying to defend his daddy Google. Even his own sites are affected by these Google problems, but instead, he writes about how great daddy Google is. His site is suppose to be about Search Engine Optimization, but he can't even get his home page indexed by Google:

site:chriseo.com

chriseo.com/
Similar pages

Now that's an SEO I want working for me. If instead he talked about how Google works and why his site is missing from Google's index, I might have a little more respect for the guy. Instead I don't have respect for someone who is an SEO and claims Google is great and at the same time can't get in Google index. 's broken index and how Google will destroy anyone without holding back. At this point I wouldn't hire hire Chris to do my laundry let alone be my SEO.

# posted by

Anonymous : 10:08 AM

Wow, that is quite the long post, but I will attempt to answer everything you said, if possible.

Google should respond to Anthony's questions as to why the front page shows the same number, and why some URLs contain no page info. In the absense of a response from Google, the user community can try to provide the answers.

The reason for both are the same, and yes it has to do with the DocID number. Google's front page number is not hand-edited. It is a non-real time computer-generated assesment of the database. The DocID problem prevents Google from counting its entire listings. Google has a lot more pages indexed, it just can't count them.

The reason for the "empty" results are because of Google's new index, which is the incremental crawler. The incremental crawler only crawls pages that Google knows to be updated since the crawler went online in 7/03. Most pages that have not been updated since have no info. Check it out. They have only been crawled by Google's original batch crawler, and remain in the database, but are acknowledged to be "old". Be thankful Google held on to the old results, when it could have dropped and forgotten about them.

Once again, database problems can result in Google being able to index more of the web than it can display at one time. 5,760,000,000 pages may be the upper limit of results Google can display, not the upper limit of pages indexed. We just don't know.

insidegoogle.blogspot.com is relatively new. Google is discovering more pages of the site daily as it crawls the site. I run a twice-daily search, and I've seen the search results grow every time.

1. Google cannot count its own database. The database cannot count itself. Any number would be an estimate, and Google doesn't want to sound like it's making things up, so it goes with the last number it can prove.

2. Valid pages are not dropped. They are just stuck in the old index due to not being updated.

4. Google has indexed those pages and given them high page ranks in the old index. PageRanks are still honored by the new index.

5. Some robots.txt files don't work right.

7. I have fewer inbound links than I would like. A four-week old blog will take time to properly index. That's fine with me, as long as they get to it eventually. New post pages on sites Google indexes constantly still take a few days to be indexed. It's a known phenomenon.

Any more questions I will be glad to try to answer.

# posted by

Nathan Weinberg : 10:50 AM

Nathan, you're starting to sound like Chris Ridings by not providing answers. Instead you are applying a smoke screen to the real problems at Google. In reality you are providing more documentation to prove the "Is Google Broken" theories floating out there.

You said:
1. Google cannot count its own database. The database cannot count itself. Any number would be an estimate, and Google doesn't want to sound like it's making things up, so it goes with the last number it can prove.

So you're trying to tell me that Google can't issue an SQL statement such as:

select count(*) form URLS where status == 200;

The above statement would "count" all documents Google has in their database that have a status of 200 (valid, crawled and searchable). If they can count to 4,285,199,774 why can't they count to 5,760,000,000? But yet they can count to 5,760,000,000 when I issue a query of +the. So Google can't count and because they can't count, they lie. This I think is more ridiculous than Chris' statement. Why would Google lie to us?

2. Valid pages are not dropped. They are just stuck in the old index due to not being updated.

So you're saying that when I enter a search term at Google, Google not only makes the request to their old broken index, but also to their new index (assuming of course they have two)? I guess this would justify the "about 5,760,000,000" when issuing a query for the +the. This would include their 4,285,199,774 documents then wouldn't it? So the index is still only 4,285,199,774 documents and the other 1.5 billion is from a secondary index which is could be the one returning the empty pages. Again, shows Google is broken.

4. Google has indexed those pages and given them high page ranks in the old index. PageRanks are still honored by the new index.

So you are indeed confirming Google must have hit this 4.2 billion records and is the reason it now has two indexes. OK let's go with that then. Can you now tell me how PageRank and content relevancy would be applied on one query to two separate indexes? If this were true, PageRank would be more flawed than what has been said. Again, shows Google is broken.

5. Some robots.txt files don't work right.

So you're saying Hotbot who has been around longer than Google doesn't know how to create a robots.txt file? It's just a text file and the syntax is perfect.

http://www.hotbot.com/robots.txt

# No robot will spider the domain
User-agent: *
Disallow: /

Are you telling me this syntax is incorrect? Can you not view this file with your browser? YES the syntax is perfectly fine and you can view it with your browser! But Google can't now can it? Again this confirms one of the "Is Google Broken" theories.

7. I have fewer inbound links than I would like. A four-week old blog will take time to properly index. That's fine with me, as long as they get to it eventually. New post pages on sites Google indexes constantly still take a few days to be indexed. It's a known phenomenon.

Well I can't comment on the popularity of your site. I only included that query as a reference. But it doesn't answer the many posts all over the Internet regarding the dropping of pages for over a year now. Instead of wanting answers like a Journalist would, you call it a phenomenon. Now that's good journalism.

In my opinion you've only enforced the "Is Google Broken" theory and I thank you for that. But to defend these quirks as being a phenomenon and OK will only later haunt you I think. I use to be a Google lover too and often made the same kinds of statements to others when they complained about Google. But one day I opened my eyes and got out of that Google bubble and found these people were not exaggerating. Their pages were being dropped, pages excluded by the robots.txt were being indexed, empty pages do continually show up after months at a time, PageRank is being stolen and so on.

I am no longer a Google Lemming. Now I look at hard facts and real site owner examples. When I did this, the real Google was finally exposed to me and it would be for you too if you didn't just call it a phenomenon and went out and did your own research like a Journalist should.

# posted by

Anonymous : 1:21 PM

To further explain:

Google is more than just a search engine, it is an operating system with a unique file structure that spans a huge number of computers and applications. In July of 2003, Google upgraded to a newer version of the file system. Rather than cripple the database with backward compatability, Google made it so the new database did not work with the old system, but instead could use the old system for queries. The new system searches its own file system, then searches the old one, and combines the two into a single search engine results page. This allows the new database to be more comprehensive while losing no data that the previous database held.

The only thing that the new database accesses from the old one is PageRank data and associated backlinks and keywords. The new database includes only pages updated after July 2003. If you have not updated your page, then that is most likely why you are getting blank results.

Google cannot count its own database presumably because it is not a single database. Simply counting all the entries would not do it because so many pages have multiple entries. Some pages are in both databases, others are in only one, and presumably Google was smart enough this time not to put all their eggs in one database. There could be 500 Google databases at this point, for all we know, containing so much multiple data, backups upon backups, versions upon versions, that counting it is completely impossible.

I can't faithfully answer your robots.txt question, not knowing all the details.

The "phenomenon" of a page taking a few days to be indexed is perfectly normal and acceptable, in my view. While I would like Google to notice everything the moment I write it, that's probably not going to happen in the real world. Google News and RSS notice brand new things, and until we get Google Blogs, that is going to have to be good enough.

And let's be honest, like any operating system, there are going to be bugs. At least Google is more stable and more user friendly than anything else out there.

# posted by

Nathan Weinberg : 4:17 PM

Nathan says:

"Google is more than just a search engine, it is an operating system with a unique file structure that spans a huge number of computers and applications. In July of 2003, Google upgraded to a newer version of the file system. Rather than cripple the database with backward compatability, Google made it so the new database did not work with the old system, but instead could use the old system for queries. The new system searches its own file system, then searches the old one, and combines the two into a single search engine results page. This allows the new database to be more comprehensive while losing no data that the previous database held.

"The only thing that the new database accesses from the old one is PageRank data and associated backlinks and keywords. The new database includes only pages updated after July 2003. If you have not updated your page, then that is most likely why you are getting blank results."

___________

Do you have a source for this? Preferably from the Googleplex? I follow Google rather closely, and this is the first time I've read about a new database that began in July 2003. What relationship does this have with "Supplemental Results," if any?

Finally, how can Google justify two databases except as an transitional solution? If this is how they justify it, why, after 14 months, instead of seeing progress, we see further deterioration?

-- Daniel Brandt

# posted by

Anonymous : 3:40 PM

Read this post:
http://forums.searchenginewatch.com/showpost.php?p=12722&postcount=6

So much of the information on Google is speculation. It is always difficult to get information on a company's technology. However, there were some sort of changes in the Googlebot around a year ago. Google has always had a FreshBot and a Deepbot, and it is possible that it has jumped engines. Some of what I said is speculation on what others have uncovered.

I feel more comfortable speculating on what solutions where used to fix problems that we can reasonable be sure were fixed, than speculate on problems that we cannot be sure exist.

# posted by

Nathan Weinberg : 4:14 PM

The only thing of substance in the link you offer that relates to a new index started by Google in the summer of 2003, is a 2003-09-03 piece by Danny Sullivan at: http://searchenginewatch.com/searchday/article.php/3071371

I agree with everything Danny says in that piece. Namely, a supplemental index is a horrible idea, you can't trust Google's numbers, and Google basically has nothing to say about why this separate index was started, except for this worthless comment: "The supplemental is simply a new Google experiment. As you know we're always trying new and different ways to provide high quality search results," said Google spokesperson Nate Tyler.

So I believe the answer to my question is that you don't have any information that I haven't already been contemplating for the last 12 to 17 months.

# posted by

Anonymous : 5:54 PM

Contributors