Category Information Architecture

Links for the Week 2009.05.22

Big pile o’ link love this week. Honestly, I didn’t include a ton of GREAT links from Brent Ozar because people would start to think that Brent Ozar pays me to link to his site and say Brent Ozar a lot. He doesn’t, but if you click on the links to Brent Ozar maybe he’ll see where the traffic came from and pay me to provide links to Brent Ozar

SQL Server

SQL Server 2008 Developer Training Kit Available for Download Denis Gobo provides a link and a quick summary of Microsoft’s most recent training offering for developers that will help get people up to speed with SQL Server 2008.

PASS Virtualization Virtual Chapter That’s right, we have a new thing at PASS. Well, the same old thing has a new name. SIGs are now Virtual Chapters. And Brent Ozar is now in charge of the PASS Virtualization Chapter. Check it out!

Download – SQL Server 2008 Developer Training Kit Free training. Free training. Freetraining. freetraining.

Excel Functions for SQL Server Sometimes I’ll find myself using SQL Server and longing for something from Excel that one of my more management-type friends has shown me. Now I can, in theory, have some of that Excel love right in SQL Server.

What’s a ‘DBA’? I’ve known for a long time that, while I love data, I’m not a DBA… not 100%, at least. Sam Bendayan answers the question and talks about what job title options there are for database professionals

Development

MvcFluentHtml – Fluent HTML Interface For MS MVC ASP.NET MVC doesn’t us a bad method to generate HTML, but there are definitely smoother ways, depending on how preference. Fluent HTML uses one paradigm to make it a bit easier to generate HTML in your views. It’s closer to how Ruby on Rails does things, and I like Rails. A lot. Almost as much as I like SQL Server.

Dear Art Director…

Stuff & Things

The Information Architecture of Personal Music Collections Dan Brown, famed Information Architect not famed author, spent a lot of time thinking about how people interact with music libraries. The poster is from 2005 but, shockingly, not a lot has changed since then.

11 Striking Findings From an Eye Tracking Study Eye tracking is some great stuff, it’s right up there with click tracking. It helps us, as bloggers, figure out what you, the readers, are paying attention to.

How to Maintain a Healthy Lifestyle When You’re Too Busy To Care Title says it all. Lazy? Want to get in better shape? Do this.

Evil Lair: On the Architecture of the Enemy in Videogame Worlds I don’t know what to say about this, really. This is a fascinating article about how architecture current works its way into video games and also how it could be used.

10 for $10 hardcore summer tour This is the coolest idea for a summer tour – 10 bands for $10. If you’re at all into hardcore punk, it’ll be a great show. If you aren’t (which is more than likely since you’re reading this blog), take note because it’s an interesting idea that you might see more of in the future.

How to Build Your Own PC WARNING: NOT SAFE FOR WORK SomethingAwful.com is often flagged as adult content. Don’t visit it if you like keeping your job. That being said, this is a hilarious look at building your own computer. It’s based on Brent Ozar‘s experiences building his hackintosh.

From Data to Information

Originally, a long time ago, I became interested in programming primarily as a means of information visualization. (Well, truthfully, back in 2000/2001 I wanted to have my own blog but I got distracted and eventually learned how to do other things.) Anyway, once I discovered some of the ideas in the field of information architecture and, later, through the IxDA I was hooked. One of the principle ideas that stuck with me is that information must have a structure and must be searchable in order to be useful to a consumer. This seems like a really fundamental, simple, idea. Unfortunately at that time (and now, arguably) very few people really understood this idea, or even thought about it.

The more and more I thought about it, too, the more I realized that data, that ephemeral substance that we all capture in databases, is fundamentally without meaning and value. Data in its most basic form is merely a point or set of points on a continuum – e.g. 4, 15, 6. But when you provide additional information to the members of your data set, you suddenly gain additional information: Lemonade sales were 4 times higher when the temperature was 15 degrees higher than average in June.

There’s a distinct information lifecycle that most people are aware of, but here it is anyway if you’ve forgotten it:

Data is given meaning and becomes
Information which is processed to become
Knowledge which leads to
Understanding

Now, the knowledge and understanding part, that deals with the messy process of thinking and ideation. While thinking and ideation are fascinating topics, they deal with a number of individual biases and things like psychology and neurology and other things that I don’t find as interesting. As such, I’m not going to really deal with them at all. So, we’ll pretend they don’t exist for the purposes of this discussion.

Essentially, to get back to the main point, one of the most fascinating things to me is the process used to give meaning to data in order to turn it into information. What’s also interesting is that there is seldom a single right answer, as so many of us have discovered. The right answer depends on the question asked and the important thing that we do, as keepers of data, is enable people to ask those questions and turn facts and figures and meaningless piles of data into information for the purpose of making business decisions (which hopefully doesn’t turn that 4, 15, 6 into ‘In the 4th quarter, revenue was down 15% so we’re laying off 6 of you’).

Equally as important is metadata. The more data you can collect to describe another data point, the more meaning you can give to that data point. Let’s look at a similar example: On August 16th, our little lemonade stand sold twice as much lemonade as was sold on August 23rd. Without any additional data to describe August 16th and August 23rd, we aren’t able to do any more than report raw sales figures. But what happens if we track temperature and other weather conditions? Maybe August 16th was particularly hot. Maybe it rained all day on August 23rd. Add in more information about local events. Suddenly we have an additional descriptor for these two days and we know that there was a town parade on August 16th that passed right by our lemonade stand. This is the type of metadata that turns meaningless data points into valuable pieces of information. By leveraging technology it’s possible to easily associate these points of data with their descriptors and build a meaningful piece of information that is surrounded by descriptive metadata that enables rapid decision making and facilitates easier search and browsing related topics and ideas.

Where does that leave us? Well, if the point of information is to be processed into knowledge that enables understanding then it’s fairly clear. Information retrieval systems need to provide as much context as possible to the underlying data points. The information storage systems need to be designed in a way that facilitates data collection and storage. Specifically, storage systems need to be designed in a way that allows for the storage of diverse types of metadata – documents, images, raw text, audio, and video files all need to be stored to enable the transformation of raw data points into information.

At what point do we stop collecting data and start aggregating data from disparate sources? There comes a point when we simply can’t store enough data fast enough from all potential collection sources. At this point, we need to rely on others to help us turn our data into information. In turn, there is probably some of our data that will help our data providers turn their data into information. Slowly but surely, our growing need to turn data into information via included metadata will enable us to access an increasingly complex and interconnected world.

Shingling – it’s not just for roofers!

I was catching up on the Information Architecture Institute mailing list and some feed reader backlog when I came across the concept of shingling. Seeing as how I have never heard of this term in the 8+ years I’ve been working, I decided it was high time that I learned about it.

In essence, shingling seeks to solve the problem of indexing large quantities of data:

  • How can you tell if two pieces of content are the same? You compare them.
  • What happens when you want to compare a lot of pages? You have to make a very large number of comparisons.
  • What happens when you’re google? You can’t solve this problem by bulk comparisons

The number of computations required to compare all of your content rapidly trends toward infinity. Enter shingling.

Shingling is nothing more than taking a single document, splitting it into smaller chunks, and generating a sufficiently sized unique fingerprint from the shingles. This way, when you are indexing content, you can immediately break down the document into the constituent shingles, generate a fingerprint, and then see if that fingerprint already exists. What you do at this point is up to you.

I ran into a similar problem when implementing an n-gram search engine that ultimately proved to be inefficient. Essentially, the computation required to generate the n-grams for a given page of content, check the index for an existing n-gram, and associate the indexed word/content chunk with the given n-gram soon proved terribly inefficient. This n-gram search engine only had to index names, addresses, and email addresses — it wasn’t even attempting to provide an index of user generated content.

I have to wonder how full-text indexing providers, like SQL Server and Ferret, will (or already have) handle the challenge of shingling indexed content. It seems like it would be a concern both on the front of storage consolidation and for the purposes of optimizing CPU cycles — for retrieval, indexing, and comparison.

References
Near-duplicates and shingling
Navigating the network of knowledge: Mining quotations from massive-scale digital libraries of books
N-gram, From Wikipedia, the free encyclopedia

This site is protected with Urban Giraffe's plugin 'HTML Purified' and Edward Z. Yang's Powered by HTML Purifier. 531 items have been purified.