Inverse Document Frequency In Plain English – Dr E. Garcia

SEW Moderator OrionI recently had the opportunity to discuss some information retrieval theory with Dr. Edel Garcia, who is a researcher and professor in the field, as well as a longtime SEW Forum moderator (aka Orion). He helped me understand what Inverse Document Frequency (IDF) is, and what it has been used for in the information retrieval field. Dr. Garcia generously agreed to let me publish our conversations on this blog.

(Update: Apparently Dr. Garcia is no longer active at SEW as a moderator.)

Inverse Document Frequency isn’t a measure of a page’s relevance, but can [potentially] be valuable for keyword research. However, Dr. Garcia recently presented a paper suggesting that an adaptation of IDF may be more valuable to keyword research.

Any grammar, spelling or other mistakes are my own.

Dr. Garcia:

“Thank you for taking the time to email me. I hope this helps.

1. “I recall a point you made about keyword density being a silly metric, since it’s only valuable relative to the total index.” [Citing my email]

The first part of this statement is correct, but I don’t recall ever claimed the second part; i.e., that KWD is valuable relative to the total index.

KWD is not a collection-wide statistic, but a local one and at the level of individual documents only.

2. “I was just thinking – suppose you looked at the keyword density of pages outside the top 100, on non-competitive terms.

That is, scraping results 100 – 200 and then measuring the keyword density across all those pages, as your representative sample.

(The reason for going outside the top 100, and on a non-competitive term, is to find non-SEOed content, because it [SEOed content] would have more keywords than average.)”

Sure you can try to find collection-wide statistical averages of any kind … but why use a metric that is not used in IR for trying things [that are IR related], and on top of which does not attenuate term frequencies or lacks of relevance information?

It makes more sense to use one of the many local weight IR models for this purpose, like

1 + log(f)
[or]
log(1 + f)

etc. Or use normalized versions of these or entropy-based versions of these. [Ed: Entropy is a measure of how uncertainty a piece of information is, per Wiki.]

These models have the effect of scoring local weights and dampening down term repetition abuses (keyword spam).

Better, by using these your results are closer to what is measured by an IR system. These can then be combined with collection-wide weights in the presence and absence of relevance information (IDF, IDFP, entropy, RSJ-PM, OKAPI, etc)

3. “That would then give you an idea of the average keyword density across
the ‘index’, which you could perhaps apply to your own site? Or else you
could look at the median number of times a keyword appeared in these
non-optimized, “natural” pages and use that?”

It is hard to see a statistical justification for doing this at all.

Cheers,

Dr. E. Garcia
site: http://www.miislita.com [Features a lot of valuable research, for those who can handle the technical jargon.]
blog: http://irthoughts.wordpress.com
newsletter: http://www.miislita.com/irw/ir-watch.html

Gab:

Hi Dr. Garcia,

That helps broadly, but as with most of the content on your site, the jargon is way over my head so you lost me at point 2.

What I meant about KWD being valuable relative to the index is perhaps what you refer to as inverse document frequency.

In plain English, I understand IDF to mean that the less a word is used (up to a point), the more likely it helps describe that document’s content. And so, terms that are infrequently used across the collection are probably keywords (as opposed to stopping words), and therefore worth paying attention to if you’re a search engine.

My point is that if you can figure out what range is “infrequent use” in a search engine’s collection, then you can edit your content to ‘infrequently use’ the keywords. Am I making sense?

Besides that, as a friendly tip, your content is evidently very erudite and researched, but it’s quite inaccessible to laymen. As you claim to want to help bust SEO myths and other silliness, you might be more effective (both directly in your communication and indirectly by gaining more links…) if you translated your material into simple English.

Best,
Gab

Dr. Garcia:

Thank you again for taking the time to email me.

1. KWD is not IDF. KWD is just the frequency or number of times a term is present in a document divided by the total number of words in that document.

This is why it is a local weight [i.e. as opposed to “global,” which in information retrieval refers to your whole index aka collection of documents]. It is also not a log.

2. IDF is the log of total numbers of documents in a collection divided by the number of documents containing a particular term, regardless of how many times
it appears in a document.; ie. IDF = log (D/di), where D is the total number of
documents and di is the number of documents containing a given term, ‘i’
 [counting each document only once, even if a keyword appears in it multiple times].

[Emphasis added.]

3. IDF and KWD do not incorporate relevance information.

So, for all of the above, KWD is not IDF. They also measure different things. IDF is a measure of the discriminatory power of a term in a collection.

IDF was useful with early search engines and IR systems. But because large size search engines on the Web are too generic, it has been slowly phased out by models that incorporate relevant information and more stable as the size of the collection grows.

In the early IR literature, IDF was mistaken for a measure of term importance. Terms important to average users tend to have high search volume, regardless of their IDF values.

By contrast, very rare terms might have high IDF, but this not necessarily make these important.

In fact, very rare terms are more likely to have low search volume simply because they are rare. Average users are not prone to query very rare terms at all.

If you want to find useful terms from past the top 100 or 200 results, I would suggest identifying those within that window containing high search volume instead. If you do that, you don’t need to worry about KWD or IDF.

With regard to my writing style, my target audiences is graduate students and faculty colleagues, more than marketers.

Cheers,

Dr. E. Garcia
email: admin@miislita.com
site: http://www.miislita.com
blog: http://irthoughts.wordpress.com
newsletter: http://www.miislita.com/irw/ir-watch.html

Gab:

Hi Dr. Garcia,

Heehee, I like your tips and explanation on why it works. Matter of fact, I have used that technique before and continue to advocate it, as I did at SMX West as a Q&A mod on the KW research panel :).

Cheers
Gab

Dr. Garcia:

Hi, there:

Here is a simple procedure for mining the Web for useful terms:

1. Query a search engine for a specific topic. This query can be of x number of terms or a single term.
2. Collect top N ranked titles (or entire documents).
3. Remove stop terms.
4. Remove dupe terms.
5. Evaluate with query tools their search volumes.
6. Sort these in decreasing order of search volumes.

(Try also by sorting combination of terms; i.e. phrases).

Note: The top N should belong to a topic in common (the so-called Cluster Hypothesis). [I didn’t follow up on what that is.]

However, at times, if you use too ambiguous terms in your initial query, as you move down N values it might be possible you will find that past the initial top N there will be topic transitions.

[Translation: If you use ambiguous keywords to start with, the further you go into the search results, the greater the chances you’ll find keywords on a variety of topics.]

This also occurs when queries consist of combinations of two semantically disconnected terms are used (e.g., try with “aloha Hawaii” versus “aloha any state”, where any state is any state of the Nation). I believe I covered this in an old Search Engine Watch Forum [post?] back in 2004-2006 or so.

Cheers,

Dr. E. Garcia
email: admin@miislita.com
site: http://www.miislita.com
blog: http://irthoughts.wordpress.com

Gab:

Thank YOU for taking the time to respond.

Also, I appreciate you keeping things at a level I can understand, since I’m sure that is an extra effort, given your main target audience.

Actually I found that very helpful, and was wondering if you’d mind my sharing that as a post on my blog?

Cheers
Gab

Dr. Garcia:

Hi, there:

1. Feel free to publish.

2. Glad you have figured that out. I spoke on that at SES NY 2005 in a remote presentation on co-occurrence theory.

3. I no longer speak at those events, although they were a lot of fun. To be honest, I realized those conferences are full of speakers promoting a lot of non-sense and SEO myths/hearsays/own crappy ideas, etc., which is a real shame. I have ever since concentrated on academic research events.

Cheers,

Dr. E. Garcia
email: admin@miislita.com
site: http://www.miislita.com
blog: http://irthoughts.wordpress.com
newsletter: http://www.miislita.com/irw/ir-watch.html

Dr. Garcia:

“Bit of a shame you no longer attend search shows just cuz some goofs DN what they’re saying. Doesn’t the free debate of ideas leads to the truth?

I’m thinking your presence would help minimize the spread of said crappy ideas, hehe!” [Citing my email]

My speaking engagement policies are clearly outlined in my site; i.e., I speak for free at non-for-profit events.

However, I don’t speak for free at for-profit events. The only three for-profit marketing conferences I attended outside Puerto Rico were because the organizers generously supported my presentations (travel & hotel). These were Jupitermedia’s SES (NY 2005 and San Jose, 2005) and OJOBuscador Congress, Madrid, Spain 2007.

I prefer to spend the limited time I have to teach graduate courses on IR, and mentoring graduate students and their thesis work.

With these duties, writing my IR Newsletter, acting as peer reviewers of scientific journals and serving as a program committee member of W3C’s AIRWeb Workshops, I simply don’t have the time for marketing conferences, nor do I like these when presenting “experts” are not so.

As for debating ideas in the open, that’s why I have the IR Thoughts blog: to debunk SEO Snakeoil. The blog is also used to contact students taking my graduate courses.

BTW: I am teaching a new graduate course on Web Spam wherein SEO Snakeoil Myths and case studies of these will be dissected. The syllabus is announced at
http://irthoughts.wordpress.com/2009/04/02/airweb-course-announcement/

Talking about SEO Snakeoil, recently I have had to expose some flawed arguments from a Stompernet video regarding LSI.

They tried to debunk LSI, but the arguments used were completely flawed and did more harm than the myth they tried to debunk.

As mentioned to the creator of the video, before trying to debunk something at least we need to know (understand) what exactly is that we’re trying to debunk. We cannot just debunk hearsay with more hearsay as they did. Now, that is a shame. Two wrongs do not make a right.

Check here:

http://irthoughts.wordpress.com/2009/04/09/finally-seos-are-getting-the-lsi-myth/

Cheers,

Dr. E. Garcia
email: admin@miislita.com
site: http://www.miislita.com
blog: http://irthoughts.wordpress.com
newsletter: http://www.miislita.com/irw/ir-watch.html

[My next email to Dr. Garcia just had some questions.]

Dr. Garcia: Questions 1 and 3 are related so I will answer both now. After that, I will
answer question 2.

“1) Why is a log function used for calculating IDF?

“3) Would it be accurate to describe IDF as ‘the ratio of documents in a collection to documents in that collection with a given term’? I’m guessing your answer would be, IDF is the [LOG of ” the ratio of documents in a collection to documents in that collection with a given term”]? Which brings us back to question 1, I guess? hehe”

These are recurrent questions students have asked me before. The reason for using logs is due to two assumptions frequently made in most IR models;

i.e.

I. Scoring functions are additive.
II. Terms are independent.

While in some models II might not be present, both (I and II) play well with logs since these also are additive.

These functions and ‘why the use of logs?’ is explained in the recent RSJ-PM Tutorial:

http://www.miislita.com/information-retrieval-tutorial/information-retrieval-probabilistic-model-tutorial.pdf

Document Frequency (DF) is defined as d/D, where d is the number of documents containing a given term and D is the size of the collection of documents. If we take logs we obtain log(d/D).

But since often D > d the log of d/D, that is log(d/D) gives a negative value. To get rid of the negative sign, we simply invert the ratio inside the log expression.

Essentially we are compressing the scale of values so that very large or very small quantity is smoothly compared.

Now log(D/d) is conveniently called Inverse Document Frequency.

Now going back to d/D, this is a probability estimate p that a given event has occurred. Let the presence of a term in a document be that event. If terms are independent, it must follow that for any two events, A and B

p(AB) = p(A)p(B).

Taking logs we can write

log[p(AB)] = log[p(A)]+ log[p(B)]

It is easy to show that for two terms

log(d12/D) = log(d1/d) + log(d2/D)

[Ed: This isn’t 12 in the sense of 6×2. See the explanation just below, marked as an update.]

Inverting and using the definition of IDF we end up with

IDF12 = IDF1 + IDF2

validating assumption I; that IDF as a scoring function is additive.

That is the IDF of a two-term query is the sum of individual IDF values. However, this is only valid if terms are independent of one another. If terms are not independent we would have two possibilities; i.e.,

p(AB) > p(A) + p(B)

or

p(AB) < p(A) + p(B)

and we cannot say that the IDF of a two-term query (e.g, a phrase) is the sum of individual IDF values. Assuming the contrary as many SEOs think is plain snake oil.

Update/Clarification: I asked for a clarification on all this, and Dr. Garcia again generously shared his knowledge:

“The “1”, “2”, and “12” above are mere subscripts, not values.

So IDF1 stands for the IDF of term 1, IDF2 for the IDF of term 2, and IDF12 for the IDF of terms 1 and 2. An example follows.

A collection conssists of 100 docs (D =100)

20 docs mention t1=search, so IDF1 = log(100/20) or about 0.7
40 docs mention t2=marketing, so IDF2 = log(100/40) or about 0.4

What would be the IDF of a combination of terms like search marketing?

If we assume that terms are independent IDF12 should be the sum. I.e. IDF12= IDF1 + IDF2 = 0.7 + 0.4 or about 1.1.

This sum corresponds to 8 documents of the 100 documents mentioning both terms.

If we find out that more than 8 docs of the 100 docs, mention both terms, these are not independent of one another, but actually positively correlated.

For example, if 15 docs mention both terms, the IDF12 will not be 1.1 but actually log(100/15) or about 0.8

The reverse scenario, if less than 8 docs of the 100 docs actually mentioned both terms is an example of terms negatively correlated, still not being independent.

Consequently, the exercise made by SEO keyword researchers of adding individual terms IDFs to come up with some sort of IDFs for phrases is flawed by all accounts.

[Ed: It just occurred to me that at any rate, you’d need to know the total size of the index in order to calculate IDF. AFAIK, that # isn’t public.]

2) What do you mean by ‘discriminatory power’ in the phrase “IDF is a measure of the discriminatory power of a term in a collection.”

This is a legacy idea from Robertson and Sparck Jones. The discriminatory power of a term (aka term specificity) implies that terms too frequently used are not good discriminators between documents.

If a term is used in too many documents its use to discriminate between documents is poor. By contrast, rare terms are assumed to be good discriminators since they appear in few documents.

One particular research project I am working on contradicts the whole notion of defining the discriminatory power of a term using IDF. I cannot comment too much on it, however, I presented preliminary results last month at a local scientific conference. The abstract is at

http://irthoughts.wordpress.com/2009/03/05/sidim-xxiv-conference/

Cheers,

Dr. E. Garcia
email: admin@miislita.com
site: http://www.miislita.com
blog: http://irthoughts.wordpress.com
newsletter: http://www.miislita.com/irw/ir-watch.html

Gab: p.s. Read also Dr. Garcia’s post recapping the SIDIM conference and sharing an explanation of IDF and “vector space models.”

p.p.s. If you liked this post, you’ll likely enjoy some of the other interviews in my related posts’ section, especially this one with Ruud Hein: “How I Stay Ahead of the SEO Pack”. It’s probably also a good idea to add my RSS feed to your reader!

sroiadmin
Author: sroiadmin