Inverse Document Frequency In Plain English – Dr E. Garcia

Author: Gab Goldenberg

SEW Moderator OrionI recently had the opportunity to discuss some information retrieval theory with Dr Edel Garcia, who is a researcher and professor in the field, as well as a longtime SEW Forum moderator (aka Orion). He helped me understand what Inverse Document Frequency (IDF) is, and what it has been used for in the information retrieval field. Dr. Garcia generously agreed to let me publish our conversations on this blog.

(Update: Apparently Dr. Garcia is no longer active at SEW as a moderator.)

Inverse Document Frequency isn’t a measure of a page’s relevance, but can [potentially] be valuable for keyword research. However, Dr. Garcia recently presented a paper suggesting that an adaptation of IDF may be more valuable to keyword research.

Any grammar, spelling or other mistakes are my own.

Dr Garcia:

“Thank your for taking the time to email me. I hope this helps.

1. “I recall a point you made about keyword density being a silly metric, since it’s only valuable relative to the total index.” [Citing my email]

The first part of this statement is correct, but I don’t recall ever claimed the second part; i.e., that KWD is valuable relative to the total index.

KWD is not a collection-wide statistic, but a local one and at the level of individual documents only.

2. “I was just thinking – suppose you looked at the keyword density of pages outside the top 100, on non-competitive terms.

That is, scraping results 100 – 200 and then measuring the keyword density across all those pages, as your representative sample.

(The reason for going outside the top 100, and on a non-competitive term, is to find non-SEOed content, because it [SEOed content] would have more keywords than average.)”

Sure you can try to find collection-wide statistical averages of any kind … but why use a metric that is not used in IR for trying things [that are IR related], and on top of which does not attenuate term frequencies or lacks of relevance information?

It makes more sense to use one of the many local weight IR models for this purpose, like

1 + log(f)
[or]
log(1 + f)

etc. Or use normalized versions of these or entropy-based versions of these. [Ed: Entropy is a measure of how uncertainty a piece of information is, per Wiki.]

These models have the effect of scoring local weights and dampening down term repetition abuses (keyword spam).

Better, by using these your results are closer to what is measured by an IR system. These can then be combined with collection-wide weights in the presence and absence of relevance information (IDF, IDFP, entropy, RSJ-PM, OKAPI, etc)

3. “That would then give you an idea of the average keyword density across
the ‘index’, which you could perhaps apply to your own site? Or else you
could look at the median number of times a keyword appeared in these
non-optimized, “natural” pages and use that?”

It is hard to see a statistical justification for doing this at all.

Cheers,

Dr. E. Garcia
site: http://www.miislita.com [Features a lot of valuable research, for those who can handle the technical jargon.]
blog: http://irthoughts.wordpress.com
newsletter: http://www.miislita.com/irw/ir-watch.html

Gab:

Hi Dr Garcia,

That helps broadly, but as with most of the content on your site, the jargon is way over my head so you lost me at point 2.

What I meant about KWD being valuable relative to the index is perhaps what you refer to as inverse document frequency.

In plain English, I understand IDF to mean that the less a word is used (up to a point), the more likely it helps describe that document’s content. And so, terms that are infrequently used across the collection are probably keywords (as opposed to stop words), and therefore worth paying attention to if you’re a search engine.

My point being that if you can figure out what range is “infrequent use” in a search engine’s collection, then you can edit your content to ‘infrequently use’ the keywords. Am I making sense?

Besides that, as a friendly tip, your content is evidently very erudite and researched, but it’s quite inaccessible to laymen. As you claim to want to help bust SEO myths and other silliness, you might be more effective (both directly in your communication and indirectly by gaining more links…) if you translated your material into simple English.

Best,
Gab

Dr Garcia:

Thank you again for taking the time to email me.

1. KWD is not IDF. KWD is just the frequency or number of times a term is present in a document divided by total number of words in that document.

This is why it is a local weight [i.e. as opposed to "global," which in information retrieval refers to your whole index aka collection of documents]. It is also not a log.

2. IDF is the log of total numbers of documents in a collection divided by number of documents containing a particular term, regardless of how many times
it appears in a document.; ie. IDF = log (D/di), where D is the total number of
documents and di is the number of documents containing a given term, ‘i’
[counting each document only once, even if a keyword appears in it multiple times].

[Emphasis added.]

3. IDF and KWD do not incorporate relevance information.

So, for all of the above, KWD is not IDF. They also measure different things. IDF is a measure of the discriminatory power of a term in a collection.

IDF was useful with early search engines and IR systems. But because large size search engines on the Web are too generic, it has been slowly phased out by models that incorporate relevance information and more stable as the size of the collection grows.

In the early IR literature, IDF was mistaken for a measure of term importance. Terms important to average users tend to have high search volume, regardless of their IDF values.

By contrast, very rare terms might have high IDF, but this not necessarily makes these important.

In fact, very rare terms are more likely to have low search volume simply because they are rare. Average users are not prone to query very rare terms at all.

If you want to find useful terms from past the top 100 or 200 results, I would suggest to identify those within that window containing high search volume instead. If you do that, you don’t need to worry about KWD or IDF.

With regard to my writing style, my target audience are graduate students and faculty colleagues, more than marketers.

Cheers,

Dr. E. Garcia
email: admin@miislita.com
site: http://www.miislita.com
blog: http://irthoughts.wordpress.com
newsletter: http://www.miislita.com/irw/ir-watch.html

Gab:

Hi Dr. Garcia,

Heehee, I like your tips and explanation on why it works. Matter of fact, I have used that technique before and continue to advocate it, as I did at SMX West as a Q&A mod on the KW research panel :).

Cheers
Gab

Dr. Garcia:

Hi, there:

Here is a simple procedure for mining the Web for useful terms:

1. Query a search engine for a specific topic. This query can be of x number of terms or a single term.
2. Collect top N ranked titles (or entire documents).
3. Remove stop terms.
4. Remove dupe terms.
5. Evaluate with query tools their search volumes.
6. Sort these in decreasing order of search volumes.

(Try also by sorting combination of terms; i.e. phrases).

Note: The top N should belong to a topic in common (the so-called Cluster Hypothesis). [I didn't follow up on what that is.]

However, at times, if you use too ambiguous terms in your initial query, as you moves down N values it might be possible you will find that past the initial top N there will be topic transitions.

[Translation: If you use ambiguous keywords to start with, the further you go into the search results, the greater the chances you'll find keywords on a variety of topics.]

This also occurs when queries consist of combinations of too semantically disconnected terms are used (e.g., try with “aloha hawaii” versus “aloha anystate”, where anystate is any state of the Nation). I believe I covered this in an old Search Engine Watch Forum [post?] back in 2004-2006 or so.

Cheers,

Dr. E. Garcia
email: admin@miislita.com
site: http://www.miislita.com
blog: http://irthoughts.wordpress.com

Gab:

Thank YOU for taking the time to respond.

Also, I appreciate you keeping things at a level I can understand, since I’m sure that is an extra effort, given your main target audience.

Actually I found that very helpful, and was wondering if you’d mind my sharing that as a post on my blog?

Cheers
Gab

Dr. Garcia:

Hi, there:

1. Feel free to publish.

2. Glad you have figured that out. I spoke on that at SES NY 2005 in a remote presentation on co-occurrence theory.

3. I no longer speak at those events, although they were a lot of fun. To be honest, I realized those conferences are full of speakers promoting a lot of non-sense and SEO myths/hearsays/own crappy ideas, etc., which is a real shame. I have ever since concentrated on academic research events.

Cheers,

Dr. E. Garcia
email: admin@miislita.com
site: http://www.miislita.com
blog: http://irthoughts.wordpress.com
newsletter: http://www.miislita.com/irw/ir-watch.html

Dr. Garcia:

“Bit of a shame you no longer attend search shows just cuz some goofs DN what they’re saying. Doesn’t the free debate of ideas lead to the truth?

I’m thinking your presence would help minimize the spread of said crappy ideas, hehe!” [Citing my email]

My speaking engagement policies are clearly outlined in my site; i.e., I speak for free at non-for-profit events.

However, I don’t speak for free at for profit events. The only three for-profit marketing conferences I attended outside Puerto Rico were because the organizers generously supported my presentations (travel & hotel). These were Jupitermedia’s SES (NY 2005 and San Jose, 2005) and OJOBuscador Congress, Madrid, Spain 2007.

I prefer to spend the limited time I have to teach graduate courses on IR, and mentoring graduate students and their thesis work.

With these duties, writing my IR Newsletter, acting as peer reviewers of scientific journals and serving as program committee member of W3C’s AIRWeb Workshops, I simply don’t have the time for marketing conferences, nor do I like these when presenting “experts” are not so.

As for debating ideas in the open, that’s why I have the IR Thoughts blog: to debunk SEO Snakeoil. The blog is also used to contact students taking my graduate courses.

BTW: I am teaching a new graduate course on Web Spam wherein SEO Snakeoil Myths and case studies of these will be dissected. The syllabus is announced at
http://irthoughts.wordpress.com/2009/04/02/airweb-course-announcement/

Talking about SEO Snakeoil, recently I have had to expose some flawed arguments from a Stompernet video regarding LSI.

They tried to debunk LSI, but the arguments used were completely flawed and did more harm than the myth they tried to debunk.

As mentioned to the creator of the video, before trying to debunk something at least we need to know (understand) what exactly is that we’re trying to debunk. We cannot just debunk hearsay with more hearsay as they did. Now, that is a shame. Two wrongs do not make a right.

Check here:

http://irthoughts.wordpress.com/2009/04/09/finally-seos-are-getting-the-lsi-myth/

Cheers,

Dr. E. Garcia
email: admin@miislita.com
site: http://www.miislita.com
blog: http://irthoughts.wordpress.com
newsletter: http://www.miislita.com/irw/ir-watch.html

[My next email to Dr. Garcia just had some questions.]

Dr. Garcia: Questions 1 and 3 are related so I will answer both now. After that, I will
answer question 2.

“1) Why is a log function used for calculating IDF?

“3) Would it be accurate to describe IDF as ‘the ratio of documents in a collection to documents in that collection with a given term’? I’m guessing your answer would be, IDF is the [LOG of " the ratio of documents in a collection to documents in that collection with a given term"]? Which brings us back to question 1, I guess? hehe”

These are recurrent questions students have asked me before. The reason for using logs is due to two assumptions frequently made in most IR models;

i.e.

I. Scoring functions are additive.
II. Terms are independent.

While in some models II might not be present, both (I and II) play well with logs since these also are additive.

These functions and ‘why the use of logs?’ is explained in the recent RSJ-PM Tutorial:

http://www.miislita.com/information-retrieval-tutorial/information-retrieval-probabilistic-model-tutorial.pdf

Document Frequency (DF) is defined as d/D, where d is number of documents containing a given term and D is the size of the collection of documents. If we take logs we obtain log(d/D).

But since often D > d the log of d/D, that is log(d/D) gives a negative value. To get rid off the negative sign, we simply invert the ratio inside the log expression.

Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared.

Now log(D/d) is conveniently called Inverse Document Frequency.

Now going back to d/D, this is a probability estimate p that a given event has occurred. Let the presence of a term in a document be that event. If terms are independent, it must follows that for any two events, A and B

p(AB) = p(A)p(B).

Taking logs we can write

log[p(AB)] = log[p(A)]+ log[p(B)]

It is easy to show that for two terms

log(d12/D) = log(d1/d) + log(d2/D)

[Ed: This isn’t 12 in the sense of 6×2. See the explanation just below, marked as an update.]

Inverting and using the definition of IDF we end up with

IDF12 = IDF1 + IDF2

validating assumption I; that IDF as a scoring function is additive.

That, is the IDF of a two term query is the sum of individual IDF values. However, this is only valid if terms are independent from one another. If terms are not independent we would have two possibilities; i.e.,

p(AB) > p(A) + p(B)

or

p(AB) < p(A) + p(B)

and we cannot say that the IDF of a two term query (e.g, a phrase) is the sum of individual IDF values. Assuming the contrary as many SEOs think is plain snakeoil.

Update/Clarification: I asked for a clarification on all this, and Dr Garcia again generously shared his knowledge:

“The “1″, “2″, and “12″ above are mere subscripts, not values.

So IDF1 stands for the IDF of term 1, IDF2 for the IDF of term 2, and IDF12 for the IDF of terms 1 and 2. An example follows.

A collection conssists of 100 docs (D =100)

20 docs mention t1=search, so IDF1 = log(100/20) or about 0.7
40 docs mention t2=marketing, so IDF2 = log(100/40) or about 0.4

What would be the IDF of a combination of terms like search marketing?

If we assume that terms are independent IDF12 should be the sum. I.e. IDF12= IDF1 + IDF2 = 0.7 + 0.4 or about 1.1.

This sum corresponds to 8 documents of the 100 documenst mentioning both terms.

If we find out that more than 8 docs of the 100 docs, mention both terms, these are not independent from one another, but actually positively correlated.

For example if 15 docs mention both terms, the IDF12 will not be 1.1 but actually log(100/15) or about 0.8

The reverse scenario, if less than 8 docs of the 100 docs actually mentioned both terms is an example of terms negatively correlated, still not being independent.

Consequently, the exercise made by SEO keyword researchers of adding individual terms IDFs to come up with some sort of IDFs for phrases is flawed by all accounts.

[Ed: It just occurred to me that at any rate, you'd need to know the total size of the index in order to calculate IDF. AFAIK, that # isn’t public.]

2) What do you mean by ‘discriminatory power’ in the phrase “IDF is a measure of the discriminatory power of a term in a collection.”

This is legacy idea from Robertson and Sparck Jones. The discriminatory power of a term (aka term specificity) implies that terms too frequently used are not good discriminators between documents.

If a term is used in too many documents its use to discriminate between documents is poor. By contrast, rare terms are assumed to be good discriminators since they appear in few documents.

One particular research project I am working on contradicts the whole notion of defining the discriminatory power of a term using IDF. I cannot comment too much on it, however I presented preliminary results last month at a local scientific conference. The abstract is at

http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

Cheers,

Dr. E. Garcia
email: admin@miislita.com
site: http://www.miislita.com
blog: http://irthoughts.wordpress.com
newsletter: http://www.miislita.com/irw/ir-watch.html

Gab: p.s. Read also Dr Garcia’s post recapping the SIDIM conference and sharing an explanation of IDF and “vector space models.”

p.p.s. If you liked this post, you’ll likely enjoy some of the other interviews in my related posts section, especially this one with Ruud Hein: “How I Stay Ahead of the SEO Pack”. It’s probably also a good idea to add my rss feed to your reader!

Tags: ,

Sidebar Story

Comments

  1. Thanks for the post. For the record, I don't moderate at SEW since mid 2006.

    Comment by egarcia - June 11, 2009 @ 3:56pm
  2. Oh, I wasn't aware of that, Dr. Garcia. The screenshot was taken yesterday...

    Comment by Gabriel Goldenberg - June 11, 2009 @ 11:32pm
  3. Finally, something I can really sink my teeth into. That was fantastic, thank you Dr. Garcia and of course you too Gab.

    Comment by Jeffrey Smith - June 16, 2009 @ 10:10pm
  4. Love to hear that Jeff. Thanks for the kind comment :)

    Comment by Gabriel Goldenberg - June 16, 2009 @ 11:52pm
  5. so it came out... that IDF does NOT matter for SEO relevancy. nice! and i just thought i missed something and started reading and reading and then i found your blog and was enlighted - i didnt miss anything! thank you!

    Comment by peter - July 6, 2009 @ 9:57pm

Leave a Reply