I was catching up on the Information Architecture Institute mailing list and some feed reader backlog when I came across the concept of shingling. Seeing as how I have never heard of this term in the 8+ years I’ve been working, I decided it was high time that I learned about it.
In essence, shingling seeks to solve the problem of indexing large quantities of data:
The number of computations required to compare all of your content rapidly trends toward infinity. Enter shingling.
Shingling is nothing more than taking a single document, splitting it into smaller chunks, and generating a sufficiently sized unique fingerprint from the shingles. This way, when you are indexing content, you can immediately break down the document into the constituent shingles, generate a fingerprint, and then see if that fingerprint already exists. What you do at this point is up to you.
I ran into a similar problem when implementing an n-gram search engine that ultimately proved to be inefficient. Essentially, the computation required to generate the n-grams for a given page of content, check the index for an existing n-gram, and associate the indexed word/content chunk with the given n-gram soon proved terribly inefficient. This n-gram search engine only had to index names, addresses, and email addresses — it wasn’t even attempting to provide an index of user generated content.
I have to wonder how full-text indexing providers, like SQL Server and Ferret, will (or already have) handle the challenge of shingling indexed content. It seems like it would be a concern both on the front of storage consolidation and for the purposes of optimizing CPU cycles — for retrieval, indexing, and comparison.