Content Scraper: Dirty SOB Or Helpful Little Bee?

March 26, 2007

For bloggers, as any other type of writer, it is certainly nice to get your words out there where they can be read by the most people. Syndicating your blog feeds is one way to do this, and it works nicely. There is a downside to using RSS, however.

Content scrapers are a very real part of the blogosphere. How you deal with them has an impact on your personal blogging environment and the control of work you produce. It may have an important impact on your web traffic as well.

What should you, as a blogger, do about managing feeds and the existence of content scrapers? Your choice should reflect your goals and strategy as a blogger. I thought I would share with you what my approach is and how I came to the decision I did…

What Is A Content Scraper?

Content scrapers are both web applications and individuals. Content scrapers as individuals will manipulate your blog feeds to create Web traffic for them to take advantage of. They create websites or blogs based around a feed aggregation, drawing on the work of many hardworking bloggers, and displaying those feed results on their website or blog.

As applications, content scrapers are basically active feed aggregation programs. Depending on the complexity of the programming or scripting involved, they seek out RSS feeds or allow for them to be added manually to a list and insert the output of the feeds into web pages or blogs. Often the desire is to create the appearance of a portal site that is an authoritative source for some topic.

In other words, they post your articles—taken from your RSS feeds—on their websites, without asking permission. Some of them credit you and link back to your site which helps your blog rankings and may earn you new readership. Unscrupulous content scrapers will try to pass the content off as their own, however, by reformatting the output of the RSS feed before it gets piped into their site.

Content Scrapers: Good Or Bad?

Well, if content scrapers are profiting from your work, it means that your writing gets exposure on the one hand. That might be a good thing, or maybe at least a neutral thing.

On the other hand, it means you lose control of your content. Comments and discussion spring up around your work—that you are not aware of and have no real ability to participate in. Unless, of course, you want to spend all of your spare time crawling through log files and looking for the signs of such sites, going there, and participating in discussion.

That simply is out of the realm of consideration unless you intend to do nothing else but that. It is time-consuming enough just maintaining a blog or three and any sort of lifestyle with a pervasive Internet component these days. And it means giving up control of your life and your work.

Worst of all, you are helping someone else make money off your hard work and not getting paid for it. And, they are often violating copyright laws with regards to your work, as well as terms and conditions for various ad services.

You should ask yourself what your goals are and if you are taking sufficient steps to safeguard your ability to reach them. Are you turning out truly unique content that others might consider valuable, or are you just tossing two or three sentences out at a time about something that popped in your head? Depending on your perception of the value of your efforts, you may want to take specific steps.

What Are Your Options As A Blogger?

I have read and heard a lot of interesting suggestions, ranging from not publishing RSS feeds at all to putting your name, URL, and a link in the first portion of your post. There are some good ideas and discussions about dealing with scrapers out there, just search any search engine and you will get plenty of results.

My Personal Choice To Dealing With Content Scrapers?

I decided a short while back to go with partial feeds. Let me explain why and what it has meant to me in terms of traffic.

Most of the methods for dealing with scraping take work I would rather not do. I do not feel like contacting search engines and trying to shut down AdSense accounts belonging to scrapers. I want an easy solution.

Let me outline my personal and simple strategy and why it works for me…

What I Want To Accomplish

  • I want to maintain control of my content.
  • I want to increase readership.
  • I want to stick it to unscrupulous content leeches.
  • I want to foster increased discussion.

What I Do Not Want Happening

  • I do not want to decrease readership.
  • I do not want to incur search engine penalties for duplicate content.
  • I do not want someone else getting a free ride off my hard work.
  • I do not want to ‘lock out’ those who are not online 24/7.

What solution meets those goals? Using partial feeds.

My Case For Partial Feeds?

It meets all of my stated goals. Plain and simple.

Meeting My Strategic Goals

Granted, the partial feeds solution is weak in the ’stick it to unscrupulous content leeches’ department, but it at least forces them to either link back to my blog for the full article—because they can not use the one to three paragraph excerpt that I use in my feeds unless they want people knowing they are obviously engaging in theft. Or thinking they are extremely weird for suggesting a non-existent piece that doesn’t appear on their site.

I did not decrease readership. In fact, it seems I have increased readership slightly by allowing those interested to stay updated with my current headlines but curious to read the full article.

If someone sees an article that sounds interesting enough, they will come read it on my blog. You are here reading this, after all, are you not? It also serves to keep me on my toes and ensure I put effort into making sure the excerpts sound intriguing and have catchy titles.

People can subscribe to a comments RSS feed, and if they see something worth commenting on, they will jump in. I also manage to maintain control of my content at a level that means discussion is happening on my site and not elsewhere.

The Results?

I have not found my feed readership to decrease since going to partial feeds. I have found my traffic jumped a bit, perhaps due to more people coming to the blog to read the entire piece.

I did notice one thing of great importance, that tells me what I was looking for is in fact happening. On several of the days where I had the greatest visits to my blog, I had low feed readership.

Also, on the day I had the lowest feed readership over the past 30 days, I took that day’s blog visitors as a base and noticed that I actually had 12 days with lower numbers of blog visitors. Consequently, the day in question where I had the lowest number of readers of the feed was after a couple days I took off. There were 18 days with higher blog readership in the month than the base day in question with the lowest feed readership.

If that is hard to make sense of, let me simplify it: partial feeds did not decrease my site traffic. In fact, it increased it slightly by getting more people to come read on the blog.

The day with the highest blog visits happened to be the month’s day with the lowest number of feed readers. Ask me which I would rather have, more feed readers or more blog visitors? I will go for readers every time.

Why? Because I am confident they will come back again if they read an entire blog posting. My readership subscription doesn’t seem to be dropping off either. And commenting is picking up.

All in all, no detrimental effects from switching to partial feeds.

Important Considerations

There are a few considerations to point out here. This is a free, hosted blog. I am not trying currently to make money from it, so I am not worried about whether my blog grows a bit more slowly than someone who might be out to get thousands of daily readers. That has two benefits for me.

One, it allows me to embrace partial feeds and forget about the majority of content scrapers.

Two, IF by some injustice, the search engines shut me down for duplicate content because of some scraper’s actions, I have the ability to move to a domain of my own.

The Importance Of Prior Planning

The eventual moving of this blog to its own domain has been an important consideration since the start of this blog. The intent all along has been to see how I could grow it, how successful I could be at it, without spending any money on it.

It was thus important to have a backup plan if things went well—and one in case they did not.

My intention has been to move the blog to its own domain if I achieved my goal of 15,000 visits by the end of April. It looks like I’m on track to do that. More importantly, if I grow at the rates specified in my goals, I will have done so without full feeds and can safely disregard the notion in the future that it will hold me back.

Are There Additional Steps I Might Take?

There are indeed more steps I could take, and may yet. If or when I move this blog to a domain of its own, I will implement more protections. For now, I am reminded of a saying by Stephen King when asked about whether or not a writer just starting out needs an agent.

I don’t recall it word for word, so I will paraphrase Mr. King: You do not need an agent until you are making enough for someone to want to steal from you. When you reach that point, you will be able to afford to choose one who will not.

I am, however, thinking about posting random “This Site Sucks, It Is Run By Unscrupulous, Lazy SOBs” posts, just to see what happens. Get some fun screen shots of scraper’s sites and let people know what is going on. Might even make a hilariously ironic blog to do so, then come back here and write a post with the screenshot…that ends back up on the scraper site.

Maybe we should have an international “Stick It To Scrapers” day in the blogosphere where everyone does just that? It’s a fun thought, at least.

Entry Filed under: Beliefs, Blogging, Blogs, Business, Culture, Entrepreneurship, Environment, Everything Else, eBusiness, eMarketing. .

3 Comments Add your own

  • 1. NWI staff  |  March 29, 2007 at 8:04 pm

    Sean,

    Insightful post; you’ve raised many questions and perspectives of note.

    Our posts have showed up without permission; we’ve tried contacting the bloggers and where appropriate at least secure link exchanges. Zero successes so far.

    Our posts are generally cited by publication and date. For example My Errant Mind’s piece about Renegade Motion Pictures proved impetus for one of our favorite posts.

    http://notedwithinterest.wordpress.com/2007/03/20/noted-with-interest-renegade-motion-pictures-sells-shares-of-new-horror-movie-to-fans/

    Please continue sharing knowledge as it is greatly appreciated.

    Regards,

    Noted With Interest staff
    http://NotedWithInterest.wordpress.com/
    “Finding New Business from Open-source Intelligence”

  • 2. Sean Wilson  |  March 30, 2007 at 2:48 pm

    Thanks for stopping by and sharing your comments. It has been my experience in my brief blogging time that it is almost pointless to try and reason with scrapers. If they were reasonable or had any ethics to begin with, they probably wouldn’t be scraping your content.

    I like your blog by the way.

    :)

    Have a great day.

  • 3. rose m.  |  February 26, 2008 at 1:51 am

    These two site have stole content from my site!! http://www.royalidea.blogspot.com and http://www.hauntologie.blogspot.com. they took my best articles and put it on their sites and they’re saying they are the author!!! POS sites. No links, mentions that I wrote this and they get DUGG for it!! How you like that? Damn article pirates. My site has the original articles in tact and I will find a way to ban these sites.

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


Impetus

Caffeine fueled emarketing, politics, business, Linux, philosophy, beer, boxing, music, technology, and writing. And other stuff, too...




Ron Paul 2008 - Hope for America

Site Map & Suggestions

Top Posts

Archives

Pages

Feeds

We're Discussing

Community

2k Bloggers


Contact me about reciprocal linking and getting your site listed in my blogroll.

Links

Spam Blocked

Blog Stats

Showing Support