Tag IA

From Data to Information

Originally, a long time ago, I became interested in programming primarily as a means of information visualization. (Well, truthfully, back in 2000/2001 I wanted to have my own blog but I got distracted and eventually learned how to do other things.) Anyway, once I discovered some of the ideas in the field of information architecture and, later, through the IxDA I was hooked. One of the principle ideas that stuck with me is that information must have a structure and must be searchable in order to be useful to a consumer. This seems like a really fundamental, simple, idea. Unfortunately at that time (and now, arguably) very few people really understood this idea, or even thought about it.

The more and more I thought about it, too, the more I realized that data, that ephemeral substance that we all capture in databases, is fundamentally without meaning and value. Data in its most basic form is merely a point or set of points on a continuum – e.g. 4, 15, 6. But when you provide additional information to the members of your data set, you suddenly gain additional information: Lemonade sales were 4 times higher when the temperature was 15 degrees higher than average in June.

There’s a distinct information lifecycle that most people are aware of, but here it is anyway if you’ve forgotten it:

Data is given meaning and becomes
Information which is processed to become
Knowledge which leads to

Now, the knowledge and understanding part, that deals with the messy process of thinking and ideation. While thinking and ideation are fascinating topics, they deal with a number of individual biases and things like psychology and neurology and other things that I don’t find as interesting. As such, I’m not going to really deal with them at all. So, we’ll pretend they don’t exist for the purposes of this discussion.

Essentially, to get back to the main point, one of the most fascinating things to me is the process used to give meaning to data in order to turn it into information. What’s also interesting is that there is seldom a single right answer, as so many of us have discovered. The right answer depends on the question asked and the important thing that we do, as keepers of data, is enable people to ask those questions and turn facts and figures and meaningless piles of data into information for the purpose of making business decisions (which hopefully doesn’t turn that 4, 15, 6 into ‘In the 4th quarter, revenue was down 15% so we’re laying off 6 of you’).

Equally as important is metadata. The more data you can collect to describe another data point, the more meaning you can give to that data point. Let’s look at a similar example: On August 16th, our little lemonade stand sold twice as much lemonade as was sold on August 23rd. Without any additional data to describe August 16th and August 23rd, we aren’t able to do any more than report raw sales figures. But what happens if we track temperature and other weather conditions? Maybe August 16th was particularly hot. Maybe it rained all day on August 23rd. Add in more information about local events. Suddenly we have an additional descriptor for these two days and we know that there was a town parade on August 16th that passed right by our lemonade stand. This is the type of metadata that turns meaningless data points into valuable pieces of information. By leveraging technology it’s possible to easily associate these points of data with their descriptors and build a meaningful piece of information that is surrounded by descriptive metadata that enables rapid decision making and facilitates easier search and browsing related topics and ideas.

Where does that leave us? Well, if the point of information is to be processed into knowledge that enables understanding then it’s fairly clear. Information retrieval systems need to provide as much context as possible to the underlying data points. The information storage systems need to be designed in a way that facilitates data collection and storage. Specifically, storage systems need to be designed in a way that allows for the storage of diverse types of metadata – documents, images, raw text, audio, and video files all need to be stored to enable the transformation of raw data points into information.

At what point do we stop collecting data and start aggregating data from disparate sources? There comes a point when we simply can’t store enough data fast enough from all potential collection sources. At this point, we need to rely on others to help us turn our data into information. In turn, there is probably some of our data that will help our data providers turn their data into information. Slowly but surely, our growing need to turn data into information via included metadata will enable us to access an increasingly complex and interconnected world.

Shingling – it’s not just for roofers!

I was catching up on the Information Architecture Institute mailing list and some feed reader backlog when I came across the concept of shingling. Seeing as how I have never heard of this term in the 8+ years I’ve been working, I decided it was high time that I learned about it.

In essence, shingling seeks to solve the problem of indexing large quantities of data:

  • How can you tell if two pieces of content are the same? You compare them.
  • What happens when you want to compare a lot of pages? You have to make a very large number of comparisons.
  • What happens when you’re google? You can’t solve this problem by bulk comparisons

The number of computations required to compare all of your content rapidly trends toward infinity. Enter shingling.

Shingling is nothing more than taking a single document, splitting it into smaller chunks, and generating a sufficiently sized unique fingerprint from the shingles. This way, when you are indexing content, you can immediately break down the document into the constituent shingles, generate a fingerprint, and then see if that fingerprint already exists. What you do at this point is up to you.

I ran into a similar problem when implementing an n-gram search engine that ultimately proved to be inefficient. Essentially, the computation required to generate the n-grams for a given page of content, check the index for an existing n-gram, and associate the indexed word/content chunk with the given n-gram soon proved terribly inefficient. This n-gram search engine only had to index names, addresses, and email addresses — it wasn’t even attempting to provide an index of user generated content.

I have to wonder how full-text indexing providers, like SQL Server and Ferret, will (or already have) handle the challenge of shingling indexed content. It seems like it would be a concern both on the front of storage consolidation and for the purposes of optimizing CPU cycles — for retrieval, indexing, and comparison.

Near-duplicates and shingling
Navigating the network of knowledge: Mining quotations from massive-scale digital libraries of books
N-gram, From Wikipedia, the free encyclopedia

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 531 items have been purified.