Hotels Review Scraping: 2014

Tuesday, 30 December 2014

How to scrape address from Google Maps

If you want to build a new online directory based website and want it to be popular with latest web contents, then you need the help of web scraping services from iWeb scraping. If you want to scrape address from maps.google.com, there is a specialized web scraping tool developed by iWeb scraping which can do the job for you. There are plenty of benefits with web scraping which includes market research, gathering customer information, managing product catalogs, compare prices, gather real estate data, gather job posting information etc. Web scraping technology is very popular nowadays and it saves lot of time and effort involved in manual extraction of data from websites.

The web scraping tools developed iWeb Scraping is very user-friendly and can extract specific information from targeted websites. It converts data from HTML web pages to useful formats like Excel spread sheets or Access database. Whatever web scraping requirements you have, you can contact iWeb Scraping as they have more than 3.5 years of web data extraction experience and offer the best prices in the industry. Also their services are available in 24x7 basis and free pilot projects will be done based on request.

Companies which require specific web data and look for an application which can automate the process and export the HTML data in structured format could benefit greatly from web scraping applications of iWeb scraping. You can easily extract data from multiple target websites, parse and re-assemble the information in HTML format to database or spread sheets as you wish. The application has simple point-and-click user-interface and any beginner can use it scrape address from Google Maps. If you want to gather address of people in particular region from Google maps, you can do it with help of web scraping application developed by iWebscraping.

Web Scraping is a technology that able to digest target website databases that are visible only as HTML web pages, and create a local, identical replica of those databases as a information or result. With our web scraping & web data extraction service we can capture web pages, then pin-point specific pieces of data/information you'd like to extract from web pages. What is needed in this process is much more than a Website crawler and set of Website wrappers. The time required to do web data extraction goes down in comparison to manually data copying and pasting job.

Source:http://www.articlesbase.com/information-technology-articles/how-to-scrape-address-from-google-maps-4683906.html

How to scrape address from Google Maps

Sunday, 28 December 2014

So What Exactly Is A Private Data Scraping Services To Use You?

If your computer connects to the Internet or resources on the request for this information, and queries to different servers. If you have a website to introduce to the site server recognizes your computer's IP address and displays the data and much more. Many e - commerce sites use to log your IP address, and the browsing patterns for marketing purposes.

Related Articles

Follow Some Tips For Data Scraping Services

Web Data Scraping Assuring Scraping Success Proxy Data Services

Data Scraping Services with Proxy Data Scraping

Web Data Extraction Services for Data Collection - Screen Scrapping Services, Data Mining Services

The Scraping server you connect to your destination or to process your information and make a filter. For example, IP address or protocol filtering traffic through a Scraping service. As you might guess, there are many types of Scraping services. including the ability to a high demand for the software. Email messages are quickly sent to businesses and companies to help you search for contacts.

Although there are Sanding free Scraping IP addresses in this way can work, the use of payment services, and automatic user interface (plug and play) are easy to give. Scraping web information services, thus offering a variety of relevant sources of data. Scraping information service organizations are generally used where large amounts of data every day. It is possible for you to receive efficient, high precision is also affordable.

Information on the various strategies that companies, Scraping excellent information services, and use the structure planned out and has led to the introduction of more rapid relief of the Earth.

In addition, the application software that has flexibility as a priority. In addition, there is a software that can be tailored to the needs of customers, and satisfy various customer requirements play a major role. Particular software, allows businesses to sell, a customer provides the features necessary to provide the best experience.

If you do not use a private Data Scraping Services suggest that you immediately start your Internet marketing. It is an inexpensive but vital to your marketing company. To choose how to set up a private Scraping service, visit my blog for more information. Data Scraping Services software as the activity data and provides a large amount of information, Sorting. In this way, the company reduced the cost and time savings and greater return on investment will be a concept.

Without the steady stream of data from these sites to get stopped? Scraping HTML page requests sent by argument on the web server, depending on changes in production, it is very likely to break their staff.

Data Scraping Services is common in the respective outsourcing company. Many companies outsource Data Scraping Services service companies are increasingly outsourcing these services, and generally dealing with the Internet business-related activities, in particular a lot of money, can earn.

Web Data Scraping Services, pull information from a structured plan format. Informal or semi-structured data source from the source.They are there to just work on your own server to extract data to execute. IP blocking is not a problem for them when they switch servers in minutes and back on track, scraping exercise. Try this service and you'll see what I mean.

It is an inexpensive but vital to your marketing company. To choose how to set up a private Scraping service, visit my blog for more information. Data Scraping Services software as the activity data and provides a large amount of information, Sorting. In this way, the company reduced the cost and time savings and greater return on investment will be a concept.

Source:http://www.articlesbase.com/outsourcing-articles/so-what-exactly-is-a-private-data-scraping-services-to-use-you-5587140.html

Thursday, 25 December 2014

Limitations and Challenges in Effective Web Data Mining

Web data mining and data collection is critical process for many business and market research firms today. Conventional Web data mining techniques involve search engines like Google, Yahoo, AOL, etc and keyword, directory and topic-based searches. Since the Web's existing structure cannot provide high-quality, definite and intelligent information, systematic web data mining may help you get desired business intelligence and relevant data.

Factors that affect the effectiveness of keyword-based searches include:

• Use of general or broad keywords on search engines result in millions of web pages, many of which are totally irrelevant.

• Similar or multi-variant keyword semantics my return ambiguous results. For an instant word panther could be an animal, sports accessory or movie name.

• It is quite possible that you may miss many highly relevant web pages that do not directly include the searched keyword.

The most important factor that prohibits deep web access is the effectiveness of search engine crawlers. Modern search engine crawlers or bot can not access the entire web due to bandwidth limitations. There are thousands of internet databases that can offer high-quality, editor scanned and well-maintained information, but are not accessed by the crawlers.

Almost all search engines have limited options for keyword query combination. For example Google and Yahoo provide option like phrase match or exact match to limit search results. It demands for more efforts and time to get most relevant information. Since human behavior and choices change over time, a web page needs to be updated more frequently to reflect these trends. Also, there is limited space for multi-dimensional web data mining since existing information search rely heavily on keyword-based indices, not the real data.

Above mentioned limitations and challenges have resulted in a quest for efficiently and effectively discover and use Web resources. Send us any of your queries regarding Web Data mining processes to explore the topic in more detail.

Source: http://ezinearticles.com/?Limitations-and-Challenges-in-Effective-Web-Data-Mining&id=5012994

Monday, 22 December 2014

GScholarXScraper: Hacking the GScholarScraper function with XPath

Kay Cichini recently wrote a word-cloud R function called GScholarScraper on his blog which when given a search string will scrape the associated search results returned by Google Scholar, across pages, and then produce a word-cloud visualisation.

This was of interest to me because around the same time I posted an independent Google Scholar scraper function get_google_scholar_df() which does a similar job of the scraping part of Kay’s function using XPath (whereas he had used Regular Expressions). My function worked as follows: when given a Google Scholar URL it will extract as much information as it can from each search result on the URL webpage into different columns of a dataframe structure.

In the comments of his blog post I figured it’d be fun to hack his function to provide an XPath alternative, GScholarXScraper. Essensially it’s still the same function he wrote and therefore full credit should go to Kay on this one as he fully deserves it – I certainly had no previous idea how to make a word cloud, plus I hadn’t used the tm package in ages (to the point where I’d forgotten most of it!). The main changes I made were as follows:

    Restructure internal code of GScholarScraper into a series of local functions which each do a seperate job (this made it easier for me to hack because I understood what was doing what and why).

    As far as possible, strip out Regular Expressions and replace with XPath alternatives (made possible via the XML package). Hence the change of name to GScholarXScraper. Basically, apart from a little messing about with the generation of the URLs I just copied over my get_google_scholar_df() function and removed the Regular Expression alternatives. I’m not saying one is better than the other but f0r me personally, I find XPath shorter and quicker to code but either is a good approach for web scraping like this (note to self: I really need to lean more about regular expressions!) :)

•    Vectorise a few of the loops I saw (it surprises me how second nature this has become to me – I used to find the *apply family of functions rather confusing but thankfully not so much any more!).
•    Make use of getURL from the RCurl package (I was getting some mutibyte string problems originally when using readLines but this approach automatically fixed it for me).
•    Add option to make a word-cloud from either the “title” or the “description” fields of the Google Scholar search results
•    Added steaming via the Rstem package because I couldn’t get the Snowball package to install with my version of java. This was important to me because I was getting word clouds with variations of the same word on it e.g. “game”, “games”, “gaming”.
•    Forced use of URLencode() on generation of URLs to automatically avoid problems with search terms like “Baldur’s Gate” which would otherwise fail.

I think that’s pretty much everything I added. Anyway, here’s how it works (link to full code at end of post):

</pre>
<div id="LC198"># #EXAMPLE 1: Display word cloud based on the title field of each Google Scholar search result returned</div>
<div id="LC199"># GScholarXScraper(search.str = "Baldur's Gate", field = "title", write.table = FALSE, stem = TRUE)</div>
<div id="LC200">#</div>
<div id="LC201"># # word freq</div>
<div id="LC202"># # game game 71</div>
<div id="LC203"># # comput comput 22</div>
<div id="LC204"># # video video 13</div>
<div id="LC205"># # learn learn 11</div>
<div id="LC206"># # [TRUNC...]</div>
<div id="LC207"># #</div>
<div id="LC208"># #</div>
<div id="LC209"># # Number of titles submitted = 210</div>
<div id="LC210"># #</div>
<div id="LC211"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC212"># #</div>
<div id="LC213"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>

<pre>

// image

I think that’s kind of cool and corresponds to what I would expect for a search about the legendary Baldur’s Gate computer role playing game :) The following is produced if we look at the ‘description’ filed instead of the ‘title’ field:

</pre>

<div id="LC215"># # EXAMPLE 2: Display word cloud based on the description field of each Google Scholar search result returned</div>
<div id="LC216">GScholarXScraper(search.str = "Baldur's Gate", field = "description", write.table = FALSE, stem = TRUE)</div>
<div id="LC217">#</div>
<div id="LC218"># # word freq</div>
<div id="LC219"># # page page 147</div>
<div id="LC220"># # gate gate 132</div>
<div id="LC221"># # game game 130</div>
<div id="LC222"># # baldur baldur 129</div>
<div id="LC223"># # roleplay roleplay 21</div>
<div id="LC224"># # [TRUNC...]</div>
<div id="LC225"># #</div>
<div id="LC226"># # Number of titles submitted = 210</div>
<div id="LC227"># #</div>
<div id="LC228"># # Number of results as retrieved from first webpage = 267</div>
<div id="LC229"># #</div>
<div id="LC230"># # Be aware that sometimes titles in Google Scholar outputs are truncated - that is why, i.e., some mandatory intitle-search strings may not be contained in all titles</div>
<pre>

//image

Not bad. I could see myself using the text mining and word cloud functionality with other projects I’ve been playing with such as Facebook, Google+, Yahoo search pages, Google search pages, Bing search pages… could be fun!

Many thanks again to Kay for making his code publicly available so that I could play with it and improve my programming skill set.

Code:

Full code for GScholarXScraper can be found here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/GScholarXScraper/GScholarXScraper

Original GSchloarScraper code is here: https://docs.google.com/document/d/1w_7niLqTUT0hmLxMfPEB7pGiA6MXoZBy6qPsKsEe_O0/edit?hl=en_US

Full code for just the XPath scraping function is here: https://github.com/tonybreyal/Blog-Reference-Functions/blob/master/R/googleScholarXScraper/googleScholarXScraper.R

Source:http://www.r-bloggers.com/gscholarxscraper-hacking-the-gscholarscraper-function-with-xpath/

Friday, 19 December 2014

Extractions and Skin Care

As an esthetician or skin care professional, you may have heard some controversy over the matter of performing extractions during a routine facial service. What may seem like a relatively simple procedure can actually raise great controversy in the world of esthetics. Some estheticians regard extractions as a matter of providing a complete service while others see this as inflicting trauma to the skin. Learning more about both sides of the issue can help you as a professional in making an informed decision and explaining the issue to your clients.

What is an extraction?

As a basic review, an extraction is removing impurity (plug of dead skin or oil) from a pore or pimple. It is the removal of both blackheads and whiteheads from the skin. Extractions occur after the skin has been thoroughly cleansed, exfoliated and sometimes steamed to soften the area prior to extraction.

Why Do It?

Extractions are considered a "must" by many estheticians when performing a routine facial because they want to leave their clients skin looking and feeling it's best. When done correctly, a simple extraction should be quick and relatively painless. As a trained esthetician it is important to know if your client has sensitive skin which would make them more prone to the damage that can be caused by extractions.

Why Not?

Extractions should only be performed by a trained esthetician and should not be done in excess. Extractions can cause broken capillaries or sin irritations that can lead to more (not less) breakouts. Extractions can also cause discomfort for your client when done incorrectly so you should seek their permission before performing any type of extraction during their facial. Remember your client has the right to know any product or procedure being performed on their skin and make an informed choice.

Who Decides?

As an esthetician it may be entirely up to you or it may be a procedure within your salon to do or not do extractions. It is important to check the guidelines of your employer and know their policies before performing any procedure. Remember to explain extractions and their benefits and possible complications to your client. Trust is an important part of any relationship and your client needs to know you are being open and honest with them. The last thing you want as a professional is a reputation for inflicting unnecessary and unwanted procedures or damage to your client's skin.

Bellanina Institute's owner and director, Nina Howard, is a multi-talented, forward-thinking entrepreneur who has built the Bellanina brand form the ground up to a successful million-dollar spa, spa training business, and skin care product line. Nina is a Licensed Esthetician with Para-Medical studies, Massage Therapist, Polarity Therapist, Skin Care Educator, Artist, and Professional Interior Designer.

Source:http://ezinearticles.com/?Extractions-and-Skin-Care&id=5271715

Wednesday, 17 December 2014

Benefits of Predictive Analytics and Data Mining Services

Predictive Analytics is the process of dealing with variety of data and apply various mathematical formulas to discover the best decision for a given situation. Predictive analytics gives your company a competitive edge and can be used to improve ROI substantially. It is the decision science that removes guesswork out of the decision-making process and applies proven scientific guidelines to find right solution in the shortest time possible.

Predictive analytics can be helpful in answering questions like:

•    Who are most likely to respond to your offer?
•    Who are most likely to ignore?
•    Who are most likely to discontinue your service?
•    How much a consumer will spend on your product?
•    Which transaction is a fraud?
•    Which insurance claim is a fraudulent?
•    What resource should I dedicate at a given time?

Benefits of Data mining include:

•    Better understanding of customer behavior propels better decision
•    Profitable customers can be spotted fast and served accordingly
•    Generate more business by reaching hidden markets
•    Target your Marketing message more effectively
•    Helps in minimizing risk and improves ROI.
•    Improve profitability by detecting abnormal patterns in sales, claims, transactions etc
•    Improved customer service and confidence
•    Significant reduction in Direct Marketing expenses

Basic steps of Predictive Analytics are as follows:

•    Spot the business problem or goal
•    Explore various data sources such as transaction history, user demography, catalog details, etc)
•    Extract different data patterns from the above data
•    Build a sample model based on data & problem
•    Classify data, find valuable factors, generate new variables
•    Construct a Predictive model using sample
•    Validate and Deploy this Model

Standard techniques used for it are:

•    Decision Tree
•    Multi-purpose Scaling
•    Linear Regressions
•    Logistic Regressions
•    Factor Analytics
•    Genetic Algorithms
•    Cluster Analytics
•    Product Association

Should you have any queries regarding Data Mining or Predictive Analytics applications, please feel free to contact us. We would be pleased to answer each of your queries in detail.

Source:http://ezinearticles.com/?Benefits-of-Predictive-Analytics-and-Data-Mining-Services&id=4766989

Monday, 15 December 2014

RAM Scraping a New Old Favorite For Hackers

Some of the best stories involve a conflict with an old enemy: a friend-turned-foe, long thought dead, returning from the grave for violent retribution; an ancient order of dark siders from the distant reaches of the galaxy, hiding in plain sight and waiting to seize power for themselves; a dark lord thought destroyed millennia ago, only to rise again and seek his favorite piece of jewelry. The list goes on.

Granted, 2011 isn’t quite “millennia,” and this story isn’t meant for entertainment, but the old foe in this instance is nonetheless dangerous in its own right. That is the year when RAM scraping malware first made major headlines: originating as an advanced version of the Trackr malware, controlled through a botnet, it was discovered in the compromised Point of Sale (POS) systems of a university and several hotels. And while it seemed recently that this method had dwindled in popularity, the Target and other retail breaches saw it return with a vengeance. With 110 million Target customers having their information compromised, it was easily one the largest incidents involving memory scrapers.

How does it work? First, the malware has to be introduced into the POS network, which can happen via any machine that is connected to the network, or unsecured wireless networks. Even with firewalls, an infected laptop could serve as a vector. Once installed, the malware can hide in the shadows, employing encryption or antivirus-avoiding tools to prevent its identification until it’s ready to strike. Then, when a customer’s card gets used at a POS machine, the data contained within—name, card number, security code, etc.—gets sent to the system memory. “There is that opportunity to steal the credit card information when it is in memory, perhaps even before your payment has even been authorized, and the data hasn't even been written to the hard drive yet,” says security researcher Graham Cluley.

So, why not encrypt the system’s memory, when it’s at its most vulnerable? Not that simple, sadly: “No matter how strong your encryption is, if the system needs to process data or process the code, everything needs to be decrypted in memory,” Chris Elisan, principal malware scientist at security firm RSA, explained to Dark Reading.

There are certain steps a company can take, of course, and should take, to reduce the risk. Strong passwords to access the POS machines, firewalls to isolate the POS network from the Internet, disabling remote access to POS systems, to name a few. All the same, while these measures are vital and should be used, I don’t think, in light of recent breaches, they are sufficient. Now, I wrote a short time ago about the impending October 2014 deadline imposed by the credit card industry, regarding the systematic switch to chipped credit card technology; adopting this standard will definitely assist in eradicating this problem. But, until such a time when a widespread implementation of new systems comes about, always be vigilant to protect your data from attack, because what’s old is new again, and a colossal data breach is a story consumers are liable to seek financial restitution for.

Source:http://www.netlib.com/blog/application-security/RAM-Scraping-a-New-Old-Favorite-For-Hackers.asp

Saturday, 13 December 2014

Scrape it – Save it – Get it

I imagine I’m talking to a load of developers. Which is odd seeing as I’m not a developer. In fact, I decided to lose my coding virginity by riding the ScraperWiki digger! I’m a journalist interested in data as a beat so all I need to do is scrape. All my programming will be done on ScraperWiki, as such this is the only coding home I know. So if you’re new to ScraperWiki and want to make the site a scraping home-away-from-home, here are the basics for scraping, saving and downloading your data:

With these three simple steps you can take advantage of what ScraperWiki has to offer – writing, running and debugging code in an easy to use editor; collaborative coding with chat and user viewing functions; a dashboard with all your scrapers in one place; examples, cheat sheets and documentation; a huge range of libraries at your disposal; a datastore with API callback; and email alerts to let you know when your scrapers break.

So give it a go and let us know what you think!

Source:https://blog.scraperwiki.com/2011/04/scrape-it-save-it-get-it/

Thursday, 11 December 2014

Ethics in data journalism: mass data gathering – scraping, FOI and deception

Mass data gathering – scraping, FOI, deception and harm

The data journalism practice of ‘scraping’ – getting a computer to capture information from online sources – raises some ethical issues around deception and minimisation of harm. Some scrapers, for example, ‘pretend’ to be a particular web browser, or pace their scraping activity more slowly to avoid detection. But the deception is practised on another computer, not a human – so is it deception at all? And if the ‘victim’ is a computer, is there harm?

The tension here is between the ethics of virtue (“I do not deceive”) and teleological ethics (good or bad impact of actions). A scraper might include a small element of deception, but the act of scraping (as distinct from publishing the resulting information) harms no human. Most journalists can live with that.

The exception is where a scraper makes such excessive demands on a site that it impairs that site’s performance (because it is repetitively requesting so many pages in a small space of time). This not only negatively impacts on the experience of users of the site, but consequently the site’s publishers too (in many cases sites will block sources of heavy demand, breaking the scraper anyway).

Although the harm may be justified against a wider ‘public good’, it is unnecessary: a well designed scraper should not make such excessive demands, nor should it draw attention to itself by doing so. The person writing such a scraper should ensure that it does not run more often than is necessary, or that it runs more slowly to spread the demands on the site being scraped. Notably in this regard, ProPublica’s scraping project Upton “helps you be a good citizen [by avoiding] hitting the site you’re scraping with requests that are unnecessary because you’ve already downloaded a certain page” (Merrill, 2013).

Attempts to minimise that load can itself generate ethical concerns. The creator of seminal data journalism projects chicagocrime.org and Everyblock, Adrian Holovaty, addresses some of these in his series on ‘Sane data updates’ and urges being upfront about

“which parts of the data might be out of date, how often it’s updated, which bits of the data are updated … and any other peculiarities about your process … Any application that repurposes data from another source has an obligation to explain how it gets the data … The more transparent you are about it, the better.” (Holovaty, 2013)

Publishing scraped data in full does raise legal issues around the copyright and database rights surrounding that information. The journalist should decide whether the story can be told accurately without publishing the full data.

Issues raised by scraping can also be applied to analogous methods using simple email technology, such as the mass-generation of Freedom of Information requests. Sending the same FOI request to dozens or hundreds of authorities results in a significant pressure on, and cost to, public authorities, so the public interest of the question must justify that, rather than its value as a story alone. Journalists must also check the information is not accessible through other means before embarking on a mass-email.

Source: http://onlinejournalismblog.com/2013/09/18/ethics-in-data-journalism-mass-data-gathering-scraping-foi-and-deception/

Monday, 8 December 2014

The Hubcast #4: A Guide to Boston, Scraping Local Leads, & Designers.Hubspot.com

The Hubcast Podcast Episode 004

Welcome back to The Hubcast folks! As mentioned last week, this will be a weekly podcast all about HubSpot news, tips, and tricks. Please also note the extensive show notes below including some new HubSpot video tutorials created by George Thomas.

Show Notes:

Inbound 2014

THE INSIDER’S GUIDE TO BOSTON

Boston Guide

On September 15-18, the Boston Convention & Exhibition Center will be filled with sales and marketing professionals for INBOUND 2014. Whether this will be your first time visiting Boston, you’ve visited Boston in the past, or you’ve lived in the city for years, The Insider’s Guide to Boston is your go-to guide for enjoying everything the city has to offer. Click on a persona below to get started.

Are you the The Brewmaster – The Workaholic – The Chillaxer?

Check out the guide here

HubSpot Tips & Tricks

Prospects Tool – Scrape Local Leads
Prospects Tool

This weeks tip / trick is how to silence some of the noise in your prospect tool. Sometimes you might have need to just look at local leads for calls or drop offs. We show you how to do that and much more with the HubSpot Prospects Tool.

Watch the tutorial here

HubSpot Strategy
Crack down on your sites copy.

We talk about how your home page and about pages are talking to your potential customers in all the wrong ways. Are you the me, me, me person at the digital party? Or are you letting people know how their problems can be solved by your products or services.

HubSpot Updates
(Each week on the Hubcast, George and Marcus will be looking at HubSpot’s newest updates to their software. And in this particular episode, we’ll be discussing 2 of their newest updates)
Default Contact Properties

You can now choose a default option on contact properties that sets a default value for that property that can be applied across your entire contacts database. When creating or editing a new contact property in Contacts Settings, you’ll see a new default option next to the labels on properties with field types “Dropdown,” “Radio Select” and “Single On/Off Checkbox”.

Default Contact Properties

When you set a contact property as “default”, all contacts who don’t have any value set for this property will adopt the default value you’ve selected. In the example above, we’re creating a property to track whether your contact uses a new feature. Initially, all of them would be “No,” and that’s the default property that will be applied database-wide. As a result, this’ll get stamped on each contact record the value wasn’t present on.

Now, when you want to apply a contact property across multiple contacts, you don’t have to create a list of those contacts and then create a workflow that stamps that contact property across those contacts. This new feature allows you to bypass those steps by using the “default” option on new contact properties you create.

Watch the tutorial here
RSS Module with Images
Now available is a new option within modules in the template builder that will allow you to easily add a featured image to an RSS module. This module will show a blog post’s featured image next to the feed of recent blog content. If you are a marketer, all you need to do is simply check the “Featured Image” box off in the RSS Listing module to display a list of recent COS blog posts with images on any page. No developers or code necessary to do this!

If you are a designer and want to add additional styling to an RSS module with images, you can do so using HubL tokens.

Here is documentation on how to get started.

Default Contact Properties
Watch the tutorial here

HubSpot Wishlist

The HubSpot Keywords Tool

Why oh why!!!! Hubspot why can we only have 1,000 keywords in our keywords tool? We talk about how for many companies a 1,000 keywords dont just cut it. For example Yale applaince can easily blow through those keywords.

Source: http://www.thesaleslion.com/hubcast-podcast-004/

Monday, 1 December 2014

Web Scraping’s 2013 Review – part 2

As promised we came back with the second part of this year’s web scraping review. Today we will focus not only on events of 2013 that regarded web scraping but also Big data and what this year meant for this concept.

First of all, we could not talked about the conferences in which data mining was involved without talking about TED conferences. This year the speakers focused on the power of data analysis to help medicine and to prevent possible crises in third world countries. Regarding data mining, everyone agreed that this is one of the best ways to obtain virtual data.

Also a study by MeriTalk a government IT networking group, ordered by NetApp showed this year that companies are not prepared to receive the informational revolution. The survey found that state and local IT pros are struggling to keep up with data demands. Just 59% of state and local agencies are analyzing the data they collect and less than half are using it to make strategic decisions. State and local agencies estimate that they have just 46% of the data storage and access, 42% of the computing power, and 35% of the personnel they need to successfully leverage large data sets.

Some economists argue that it is often difficult to estimate the true value of new technologies, and that Big Data may already be delivering benefits that are uncounted in official economic statistics. Cat videos and television programs on Hulu, for example, produce pleasure for Web surfers — so shouldn’t economists find a way to value such intangible activity, whether or not it moves the needle of the gross domestic product?

We will end this article with some numbers about the sumptuous growth of data available on the internet. There were 30 billion gigabytes of video, e-mails, Web transactions and business-to-business analytics in 2005. The total is expected to reach more than 20 times that figure in 2013, with off-the-charts increases to follow in the years ahead, according to researches conducted by Cisco, so as you can see we have good premises to believe that 2014 will be at least as good as 2013.

Source:http://thewebminer.com/blog/2013/12/

Friday, 28 November 2014

Scraping R-bloggers with Python – Part 2

In my previous post I showed how to write a small simple python script to download the pages of R-bloggers.com. If you followed that post and ran the script, you should have a folder on your hard drive with 2409 .html files labeled post1.html , post2.html and so forth. The next step is to write a small script that extract the information we want from each page, and store that information in a .csv file that is easily read by R. In this post I will show how to extract the post title, author name and date of a given post and store it in a .csv file with a unique id.

To do this open a document in your favorite python editor (I like to use aquamacs) and name it: extraction.py. As in the previous post we start by importing the modules that we will use for the extraction:

from BeautifulSoup import BeautifulSoup

import os
import re

As in the previous post we will be using the BeautifulSoup module to extract the relevant information from the pages. The os module is used to get a list of file from the directory where we have saved the .html files, and finally the re module allows us to use regular expressions to format the titles that include a comma value or a newline value (\n). We need to remove these as they would mess up the formatting of the .csv file.

After having read in the modules, we need to get a list of files that we can iterate over. First we need to specify the path were the files are saved, and then we use the os module to get all the filenames in the specified directory:

path = "/Users/thomasjensen/Documents/RBloggersScrape/download"

listing = os.listdir(path)

It might be that there are other files in the given directory, hence we apply a filter, in shape of a list comprehension, to weed out any file names that do not match our naming scheme:

listing = [name for name in listing if re.search(r"post\d+\.html",name) != None]

Notice that a regular expression was used to determine whether a given name in the list matched our naming scheme. For more on regular expressions have a look at this site.

The final steps in preparing our extraction is to change the working directory to where we have our .html files, and create an empty dictionary:

os.chdir(path)
data = {}

Dictionaries are one of the great features of Python. Essentially a dictionary is a mapping of a key to a specific value, however the fact that dictionaries can be nested within each other, allows us to create data structures similar to R’s data frames.

Now we are ready to begin extracting information from our downloaded pages. Much as in the previous post, we will loop over all the file names, read each file into Python and create a BeautifulSoup object from the file:

for page in listing:
    site = open(page,"rb")
    soup = BeautifulSoup(site)

In order to store the values we extract from a given page, we update the dictionary with a unique key for the page. Since our naming scheme made sure that each file had a unique name, we simply remove the .html part from the page name, and use that as our key:

key = re.sub(".html","",page)

data.update({key:{}})

This will create a mapping between our key and an empty dictionary, nested within the data dictionary. Once this is done we can start extract information and store it in our newly created nested dictionary. The content we want is located in the main column, which has the id tag “leftcontent” in the html code. To get at this we use find() function on soup object created above:

content = soup.find("div", id = "leftcontent")

The first “h1” tag in our content object contains the title, so again we will use the find() function on the content object, to find the first “h1” tag:

title = content.findNext("h1").text

To get the text within the “h1” tag the .text had been added to our search with in the content object.

To find the author name, we are lucky that there is a class of “div” tags called “meta” which contain a link with the author name in it. To get the author name we simply find the meta div class and search for a link. Then we pull out the text of the link tag:

author = content.find("div",{"class":"meta"}).findNext("a").text

Getting the date is a simple matter as it is nested within div tag with the class “date”:

date = content.find("div",{"class":"date"}).text

Once we have the three variables we put them in dictionaries that are nested within the nested dictionary we created with the key:

data[key]["title"] = title
data[key]["author"] = author
data[key]["date"] = date

Once we have run the loop and gone through all posts, we need to write them in the right format to a .csv file. To begin with we open a .csv file names output:

output = open("/Users/thomasjensen/Documents/RBloggersScrape/output.csv","wb")

then we create a header that contain the variable names and write it to the output.csv file as the first row:

variables = unicode(",".join(["id","date","author","title"]))
header = variables + "\n"
output.write(header.encode("utf8"))

Next we pull out all the unique keys from our dictionary that represent individual posts:

keys = data.keys()

Now it is a simple matter of looping through all the keys, pull out the information associated with each key, and write that information to the output.csv file:

for key in keys:
    print key
    id = key
    date = re.sub(",","",data[key]["date"])
    author = data[key]["author"]
    title = re.sub(",","",data[key]["title"])
    title = re.sub("\\n","",title)
    linelist = [id,date,author,title]
    linestring = unicode(",".join(linelist))
    linestring = linestring + "\n"
    output.write(linestring.encode("utf-8"))

Notice that we first create four variables that contain the id, date, author and title information. With regards to the title we use two regular expressions to remove any commas and “\n” from the title, as these would create new columns or new line breaks in the output.csv file. Finally we put the variables together in a list, and turn the list into a string with the list items separated by a comma. Then a linebreak is added to the end of the string, and the string is written to the output.csv file. As a last step we close the file connection:

output.close()

And that is it. If you followed the steps you should now have a csv file in your directory with 2409 rows, and four variables – ready to be read into R. Stay tuned for the next post which will show how we can use this data to see how R-bloggers has developed since 2005. The full extraction script is shown below:

from BeautifulSoup import BeautifulSoup

import os
import re

path = "/Users/thomasjensen/Documents/RBloggersScrape/download"
listing = os.listdir(path)

listing = [name for name in listing if re.search(r"post\d+\.html",name) != None]
os.chdir(path)
data = {}
for page in listing:
site = open(page,"rb")
soup = BeautifulSoup(site)
key = re.sub(".html","",page)
print key
data.update({key:{}})
content = soup.find("div", id = "leftcontent")
title = content.findNext("h1").text
author = content.find("div",{"class":"meta"}).findNext("a").text
date = content.find("div",{"class":"date"}).text
data[key]["title"] = title
data[key]["author"] = author
data[key]["date"] = date

output = open("/Users/thomasjensen/Documents/RBloggersScrape/output.csv","wb")

keys = data.keys()
variables = unicode(",".join(["id","date","author","title"]))
header = variables + "\n"
output.write(header.encode("utf8"))
for key in keys:
print key
id = key
date = re.sub(",","",data[key]["date"])
author = data[key]["author"]
title = re.sub(",","",data[key]["title"])
title = re.sub("\\n","",title)
linelist = [id,date,author,title]
linestring = unicode(",".join(linelist))
linestring = linestring + "\n"
output.write(linestring.encode("utf-8"))
output.close()

Source:http://www.r-bloggers.com/scraping-r-bloggers-with-python-part-2/

Wednesday, 26 November 2014

Data Mining and Frequent Datasets

I've been doing some work for my exams in a few days and I'm going through some past papers but unfortunately there are no corresponding answers. I've answered the question and I was wondering if someone could tell me if I am correct.

My question is

    (c) A transactional dataset, T, is given below:
    t1: Milk, Chicken, Beer
    t2: Chicken, Cheese
    t3: Cheese, Boots
    t4: Cheese, Chicken, Beer,
    t5: Chicken, Beer, Clothes, Cheese, Milk
    t6: Clothes, Beer, Milk
    t7: Beer, Milk, Clothes

    Assume that minimum support is 0.5 (minsup = 0.5).

    (i) Find all frequent itemsets.

Here is how I worked it out:

    Item : Amount
    Milk : 4
    Chicken : 4
    Beer : 5
    Cheese : 4
    Boots : 1
    Clothes : 3

Now because the minsup is 0.5 you eliminate boots and clothes and make a combo of the remaining giving:

    {items} : Amount
    {Milk, Chicken} : 2
    {Milk, Beer} : 4
    {Milk, Cheese} : 1
    {Chicken, Beer} : 3
    {Chicken, Cheese} : 3
    {Beer, Cheese} : 2

Which leaves milk and beer as the only frequent item set then as it is the only one above the minsup?

data mining

Nanor

3 Answers

There are two ways to solve the problem:

    using Apriori algorithm
    Using FP counting

Assuming that you are using Apriori, the answer you got is correct.

The algorithm is simple:

First you count frequent 1-item sets and exclude the item-sets below minimum support.

Then count frequent 2-item sets by combining frequent items from previous iteration and exclude the item-sets below support threshold.

The algorithm can go on until no item-sets are greater than threshold.

In the problem given to you, you only get 1 set of 2 items greater than threshold so you can't move further.

There is a solved example of further steps on Wikipedia here.

You can refer "Data Mining Concepts and Techniques" by Han and Kamber for more examples.

141

There is more than two algorithms to solve this problem. I will just mention a few of them: Apriori, FPGrowth, Eclat, HMine, DCI, Relim, AIM, etc. – Phil Mar 5 '13 at 7:18

OK to start, you must first understand, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information - information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases.

Now, the amount of raw data stored in corporate databases is exploding. From trillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) For instance, every day, Wal-Mart uploads 20 million point-of-sale transactions to an A&T massively parallel system with 483 processors running a centralized database.

Raw data by itself, however, does not provide much information. In today's fiercely competitive business environment, companies need to rapidly turn these terabytes of raw data into significant insights into their customers and markets to guide their marketing, investment, and management strategies.

Now you must understand that association rule mining is an important model in data mining. Its mining algorithms discover all item associations (or rules) in the data that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf) constraints. Minsup controls the minimum number of data cases that a rule must cover. Minconf controls the predictive strength of the rule.

Since only one minsup is used for the whole database, the model implicitly assumes that all items in the data are of the same nature and/or have similar frequencies in the data. This is, however, seldom the case in real- life applications. In many applications, some items appear very frequently in the data, while others rarely appear. If minsup is set too high, those rules that involve rare items will not be found. To find rules that involve both frequent and rare items, minsup has to be set very low.

This may cause combinatorial explosion because those frequent items will be associated with one another in all possible ways. This dilemma is called the rare item problem. This paper proposes a novel technique to solve this problem. The technique allows the user to specify multiple minimum supports to reflect the natures of the items and their varied frequencies in the database. In rule mining, different rules may need to satisfy different minimum supports depending on what items are in the rules.

Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf).

I hope that once you understand the very basics of data mining that the answer to this question shall become apparent.

1

The Apriori algorithm is based on the idea that for a pair o items to be frequent, each individual item should also be frequent. If the hamburguer-ketchup pair is frequent, the hamburger itself must also appear frequently in the baskets. The same can be said about the ketchup.

So for the algorithm, it is established a "threshold X" to define what is or it is not frequent. If an item appears more than X times, it is considered frequent.

The first step of the algorithm is to pass for each item in each basket, and calculate their frequency (count how many time it appears). This can be done with a hash of size N, where the position y of the hash, refers to the frequency of Y.

If item y has a frequency greater than X, it is said to be frequent.

In the second step of the algorithm, we iterate through the items again, computing the frequency of pairs in the baskets. The catch is that we compute only for items that are individually frequent. So if item y and item z are frequent on itselves, we then compute the frequency of the pair. This condition greatly reduces the pairs to compute, and the amount of memory taken.

Once this is calculated, the frequencies greater than the threshold are said frequent itemset.

Source: http://stackoverflow.com/questions/14164853/data-mining-and-frequent-datasets?rq=1

Sunday, 23 November 2014

4 Data Mining Tips to Scrap Real Estate Data; Innovative Way to Give Realty Business a boost!

Internet has become a huge source of data – in fact; it has turned into a goldmine for the marketers, from where they can easily dig the useful data!

Web scraping has become a norm in today’s competitive era, where one with maximum and relevant information wins the race!

Real Estate Data Extraction and Scraping Service

It has helped many industries to carve a niche in the market; especially real estate – Scraping real estate data has been of great help for professionals to reach out to a large number of people and gather reliable property data. However, there are some people for whom web scraping is still an alien concept; most probably because most of its advantages are not discussed.

There are institutions, companies and organizations, entrepreneurs, as well as just normal citizens generating an extraordinary amount of information every day. Property information extraction can be effectively used to get an idea about the customer psyche and even generate valuable lead to further the business.

In addition to this, data mining has also some of following uses making it an indispensable part of marketing.

Gather Properties Details from Different Geographical Locations

You are an estate agent and want to expand your business to the neighboring city or state. But, then you are short of information. You are completely aware of the properties in the vicinity and in your town; however, with data mining services will help you to get an idea about the properties in the other state. You can also approach probable clients and increase your database to offer extensive services.

Online Offers and Discounts are just a Click Away

Now, it is tough to deal with the clients, show them the property of their choice and again act as a mediator between the buyer and seller. In all this, it becomes almost difficult to take a look at some special discounts or offers. With the data mining services, you can get an insight into these amazing offers. Thus, you can plan a move or even provide your client an amazing deal.

What people are talking about – Easy Monitoring of your Online Reputation

Internet has become a melting pot where different people come together. In fact, it provides a huge platform where people discuss about their likes and dislikes. When you dig into such online forums, you can get an idea of reputation that you or your firm holds. You can know what people think about you and where you require to buck up and where you need to slow down.

A Chance to Know your Competitors Better!

Last, but not the least, you can keep an eye on the competitor. Real Estate is getting more competitive; and therefore, it is important to have knowledge about your competitors to get an upper hand. It will help you to plan your moves and strategize with more ease. Moreover, you also know what is that “something” that your competitor does not have and you have, with can be subtly highlighted.

Property information extraction can prove to be the most fruitful method to get a cutting edge in the industry.

Source: http://www.hitechbposervices.com/blog/4-data-mining-tips-to-scrap-real-estate-data-innovative-way-to-give-realty-business-a-boost/

Wednesday, 19 November 2014

Web Scraping for SEO with these Open-Source Scrapers

When conducting Search Engine Optimization (SEO), we’re required to scrape websites for data, our campaigns, and reports for our clients. At the lowest level we utilize scraping to keep track of rankings on search engines like Google, Bing, and Yahoo, even keep a track of links on websites to know when it’s completed its lifespan. Then we’ve used them to help us aggregate data from APIs, RSS feeds, and websites to conduct some of our data mining to find patterns to help us become more competitive.

So scraping is a function majority of companies (SEOmoz, Raventools, and Google) have to do to either save money, protect intellectual property, track trends, etc… Businesses can find infinite uses with scraping tools, it just depends if you’re an printed circuit board manufacturer looking for ideas on your e-mail marketing campaign or a Orange County based business trying to keep an eye out on the competition. which is why we’ve created a comprehensive list of open source scrapers out there to help all the businesses out there. Just keep in mind we haven’t used all of them!

Words of caution, web scrapers require knowledge specific to the language such as PHP & cURL. Take into considerations issues like cookie management, fault tolerance, organizing the data properly, not crashing the website being scraped, and making sure the website doesn’t prohibit scraping.

If you’re ready, here’s the list…

Erlang

    eBot

Java

    Heritrix
    Nutch
    Piggy Bank
    WebSPHINX
    WebHarvest

PHP

    PHPCrawl
    Snoopy
    SpiderMonkey

Python

    BeautifulSoap
    HarvestMan
    Scrape.py
    Scrapemark
    Scrapy **
    Mechanize

Ruby

    Anemone
    scRUBYt

We’ll come back and update this list as we encounter more! If you would like to submit a solution we missed, feel free. Also we’re looking for guides related to each of these, so if you know of any or would be interested in guesting blogging about one, let us know!

Source:http://www.annexcore.com/blog/web-scraping-for-seo-with-these-open-source-scrapers/

NHL ending dry scraping of ice before overtime

TORONTO (AP) — The NHL will no longer dry scrape the ice before overtime.

Instituted this season in an effort to reduce the number of shootouts, the dry scraping will stop after Friday's games.

The general managers decided at their meeting Tuesday to make the change after the league talked to the players' union the past few days.

Beginning Saturday, ice crews around the league will again shovel the ice after regulation as they did in previous years. The GMs said the dry scrape was causing too much of a delay. Director of hockey operations Colin Campbell said the delays were lasting from more than four minutes to almost seven.

The dry scrape initially had been approved in hopes of reducing shootouts by improving scoring chances without unduly slowing play by recoating the ice.

The GMs also discussed expanded video review, including goaltender interference, and the possibility of three-on-three overtime. The American Hockey League is experimenting with the three-on-three format this season.

This annual meeting the day after the Hockey Hall of Fame induction usually doesn't produce actual changes, with the dry scrape providing an exception.

The main purpose is to set up the March meeting in Boca Raton, Florida, where these items will be further addressed.

Source:http://missoulian.com/sports/hockey/nhl-ending-dry-scraping-of-ice-before-overtime/article_3dd5473c-6102-5800-99f7-2c98be0f99ad.html

Monday, 17 November 2014

Scraping websites using the Scraper extension for Chrome

If you are using Google Chrome there is a browser extension for scraping web pages. It’s called “Scraper” and it is easy to use. It will help you scrape a website’s content and upload the results to google docs.

Walkthrough: Scraping a website with the Scraper extension

Open Google Chrome and click on Chrome Web Store
Search for “Scraper” in extensions
The first search result is the “Scraper” extension
Click the add to chrome button.
Now let’s go back to the listing of UK MPs
Open http://www.parliament.uk/mps-lords-and-offices/mps/
Now mark the entry for one MP
http://farm9.staticflickr.com/8490/8264509932_6cc8802992_o_d.png
Right click and select “scrape similar…”
http://farm9.staticflickr.com/8200/8264509972_f3a9e5d8e8_o_d.png
A new window will appear – the scraper console
http://farm9.staticflickr.com/8073/8263440961_9b94e63d56_b_d.jpg
In the scraper console you will see the scraped content
Click on “Save to Google Docs…” to save the scraped content as a Google Spreadsheet.

Walkthrough: extended scraping with the Scraper extension

Note: Before beginning this recipe – you may find it useful to understand a bit about HTML. Read our HTML primer.

Easy wasn’t it? Now let’s do something a little more complicated. Let’s say we’re interested in the roles a specific actress played. The source for all kinds of data on this is the IMDB (You can also search on sites like DBpedia or Freebase for this kinds of information; however, we’ll stick to IMDB to show the principle)

    Let’s say we’re interested in creating a timeline with all the movies the Italian actress Asia Argento ever starred; where do we start?

    The IMDB has a quite comprehensive archive of actors. Asia Argento’s site is: http://www.imdb.com/name/nm0000782/

    If you open the page you’ll see all the roles she ever played, together with a title and the year – let’s scrape this information

    Try to scrape it like we did above

    You’ll see the list comes out garbled – this is because the list here is structured quite differently.

    Go to the scraper console. Notice the small box on the upper left, saying XPath?

    XPath is a query language for HTML and XML.

    XPath can help you find the elements in the page you’re interested in – all you need to do is find the right element and then write the xpath for it.

    Now let’s assemble our table.

    You’ll see that our current Xpath – the one including the whole information is “//div[3]/div[3]/div[2]/div”

    http://farm9.staticflickr.com/8344/8264510130_ae31697fde_o_d.png

    Xpath is very simple it tells the computer to look at the HTML document and select <div> element number 3, then in this the third one, the second one and then all <div> elements (which if you count down our list, results in exactly where you are right now.

However, we’d like to have the data separated out.
To do this use the columns part of the scraper console…
Let’s find our title first – look at the title using Inspect Element
http://farm9.staticflickr.com/8355/8263441157_b4672d01b2_o_d.png
See how the title is within a <b> tag? Let’s add the tag to our xpath.
The expression seems to work well: let’s make this our first column
In the “Columns” section, change the name of the first column to “title”
Now let’s add the XPATH for the title to it
The xpaths in the columns section are relative, that means “./b” will select the <b> element
add “./b” to the xpath for the title column and click “scrape”
http://farm9.staticflickr.com/8357/8263441315_42d6a8745d_o_d.png
See how you only get titles?
Now let’s continue for year? Years are within one <span>
Create a new column by clicking on the small plus next to your “title” column
Now create the “year” column with xpath “./span”
http://farm9.staticflickr.com/8347/8263441355_89f4315a78_o_d.png
Click on scrape and see how the year is added
See how easily we got information out of a less structured webpage?

Source: http://schoolofdata.org/handbook/recipes/scraper-extension-for-chrome/

Sunday, 16 November 2014

Building Java Object Graph with Tour de France results – using screen scraping, java.util.Parser and assorted facilities

Last Saturday, the Tour de France 2011 departed. For people like myself, enjoying sports and working on Data Visualizations on the one hand and far fetched uses of SQL on the other, the Tour de France offers a wealth of data to work with: rankings for each stage in various categories, nationalities and teams to group by, distances and velocity, years to compare with one another and the like. So it has been my intention for some time to get hold of that data in a format I could work with.

Today I finally found some time to get it done. To locate the statistics for the Tour de France editions for the last few years and get them onto my laptop and into my database. This article describes the first part of that journey: how to get the stage results from some source on the internet into my locally running Java program in an appropriate object structure.

My starting point is the official Tour de France website:

Image

This website goes back to 2007 and also has the latest (2011) results. It presents the result in a format pleasing to the human eye – based on an HTML structure that is fairly pleasing to my groping Java code as well.

Analyzing the source of the Tour de France data

I start my explorations in Firefox, using the Firebug plugin. When I select the tab with the results for a particular stage, I inspect the (AJAX) call that is made to retrieve the stage results into the browser:

Image

The URL that was accessed is www.letour.fr/2010/TDF/LIVE/us/700/classement/ITE.html . When I access that URL directly, I see an HTML fragment with the individual ranking for the 7th stage in 2010. It turns out that with ITG instead of ITE in this URL, I get the overall ranking after the 7th Stage. Using IME in stead of ITE, I get the 7th stage’s climbers’ standing. And so on.

The HTML associated with the stage standing looks like this:

Image

Which is not as user friendly as the corresponding display in the browser:

Image

but still fairly well structured and programmatically interpretable.

Retrieving HTML fragments and parsing in Java

Consuming these HTML fragments with stage standings into my own Java code is very easy. Parsing the data and turning it into sensible Java Objects is slightly more work, but still quite feasible. From the Java Objects I next need to create a persistent storage for the data – that is the subject for another article.

Using the Java URL class and its openStream method to open an InputStream on whatever content can be found at the URL, it is dead easy to start reading the HTML from the Tour de France website into my Java program. I make use of the java.util.Scanner class to work my way through the HTML by Table Row (TR element). When you inspect the HTML fragments, it is clear early on that every individual rider’s entry corresponds with a TR element, so it seems only logical to have the Scanner break up the data by TR.

private static Stage processStage(int year, int stageSequence, Map<Integer, Rider> riders) throws java.io.IOException, java.net.MalformedURLException {

    String typeOfStanding = "ITE";
     URL stageStanding = new URL("http://www.letour.fr/"+year+"/TDF/LIVE/us/"
                                +(stageSequence==0?"0":stageSequence+"00") +
                                "/classement/"+typeOfStanding+".html");
    InputStream stream = stageStanding.openStream();
    Scanner scanner = new Scanner(stream);
    scanner.useDelimiter("</tr>");
    Stage stage = new Stage();
    stage.setSequence(stageSequence);
    boolean first = true;
    boolean firstStanding = true;
    while (scanner.hasNext()) {
        String entry = scanner.next();
        if (first) {
            first = false;
            Matcher regexMatcher = regexDistance.matcher(entry);
            if (regexMatcher.find()) {
                String distanceString = regexMatcher.group();
                stage.setTotalDistance(Float.parseFloat(distanceString.substring(0, distanceString.length() - 3)));
            }
        }
        if (!first) {
            String[] els = entry.split("/td>");
            if (els.length > 1) { // only the standing-entries have more than one td element
                Integer riderNumber = Integer.parseInt(extractValue(els[2]));

                Rider rider=null;
                if (riders.containsKey(riderNumber)) {
                    rider = riders.get(riderNumber);
                }
                else {
                    rider = new Rider(extractValue(els[1]),riderNumber, extractValue(els[3]));
                    riders.put(riderNumber,rider);
                }
                Standing standing =
                    new Standing(firstStanding ? 1 : (Integer.parseInt(extractValue(els[0]).replace(".", ""))),
                                  rider,extractValue(els[4]),
                                  extractValue(els[5]));
                firstStanding = false;
                stage.getStandings().add(standing);                }
        }
    } //while
    scanner.close();
    return stage;
}

Subsequently, the TR elements need to be broken up in the TD cell elements that contain the rank, rider’s name, their number, the team they ride for and the time for the stage as well as their lag with regard to the winner. I have used a simple split (on /td>) to extract the cells. The final logic for pulling the correct value from the cell is in the method extractValue. Note: this code is not very pretty, and I am not necessarily overly proud of it. On the other hand: it is one-time-use-only code and it is still fairly compact and easy to write and read.

private static String extractValue(String el) {
    String r = el.split("</")[0];
    if (r.lastIndexOf(">") > 0) {
        r = r.substring(r.lastIndexOf(">") + 1);
    }
    return r.split("<")[0];
}

I have created a few domain classes: Rider, Stage, Standing (as well as Tour) that are a business domain like representation of the Tour de France result data. Objects based on these classes are instantiated in the processStage method that is being invoked from the processTour method.

public static void processTour(Tour tour) throws IOException, MalformedURLException {
    if (tour.isPrologue())
      tour.getStages().add(processStage(tour.getYear(),0, tour.getRiders()));

    for (int i=1;i<= tour.getNumberOfStages();i++) {
        tour.getStages().add(processStage(tour.getYear(),i, tour.getRiders()));
    }
}

When I run the TourManager class – a class that create a single Tour object for the Tour de France in 2010 –

public class TourManager {
     List<Tour> tours = new ArrayList<Tour>();
     public TourManager() {
        tours.add(new Tour(2010, 20, true));
        try {
            ProcessTourStandings.processTour(tours.get(0));
        } catch (MalformedURLException e) {
            System.out.println(e.getMessage());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
     public static void main(String[] args) {
        TourManager tm = new TourManager();
        for (Tour tour : tm.getTours()) {
            for (Stage stage : tour.getStages()) {
                System.out.println("================ Stage " + stage.getSequence() + "(" + stage.getTotalDistance() +
                                   " km)");
                for (Standing standing : stage.getStandings()) {
                    if (standing.getRank() < 4) {
                        System.out.println(standing.getRank() + "." + standing.getRider().getName());
                    }
                }
            }
        }
    }

it will print the top 3 in every stage:

Image

Source:http://technology.amis.nl/2011/07/04/building-java-object-graph-with-tour-de-france-results-using-screen-scraping-java-util-parser-and-assorted-facilities/

Friday, 14 November 2014

Scraping Data: Site-specific Extractors vs. Generic Extractors

Scraping is becoming a rather mundane job with every other organization getting its feet wet with it for their own data gathering needs. There have been enough number of crawlers built – some open-sourced and others internal to organizations for in-house utilities. Although crawling might seem like a simple technique at the onset, doing this at a large-scale is the real deal. You need to have a distributed stack set up to take care of handling huge volumes of data, to provide data in a low-latency model and also to deal with fail-overs. This still is achievable after crossing the initial tech barrier and via continuous optimizations. (P.S. Not under-estimating this part because it still needs a team of Engineers monitoring the stats and scratching their heads at times).

Social Media Scraping

Focused crawls on a predefined list of sites

However, you bump into a completely new land if your goal is to generate clean and usable data sets from these crawls i.e. “extract” data in a format that your DB can process and aid in generating insights. There are 2 ways of tackling this:

a. site-specific extractors which give desired results

b. generic extractors that result in few surprises

Assuming you still do focused crawls on a predefined list of sites, let’s go over specific scenarios when you have to pick between the two-

1. Mass-scale crawls; high-level meta data - Use generic extractors when you have a large-scale crawling requirement on a continuous basis. Large-scale would mean having to crawl sites in the range of hundreds of thousands. Since the web is a jungle and no two sites share the same template, it would be impossible to write an extractor for each. However, you have to settle in with just the document-level information from such crawls like the URL, meta keywords, blog or news titles, author, date and article content which is still enough information to be happy with if your requirement is analyzing sentiment of the data.

cb1c0_one-size

A generic extractor case

Generic extractors don’t yield accurate results and often mess up the datasets deeming it unusable. Reason being

programatically distinguishing relevant data from irrelevant datasets is a challenge. For example, how would the extractor know to skip pages that have a list of blogs and only extract the ones with the complete article. Or delineating article content from the title on a blog page is not easy either.

To summarize, below is what to expect of a generic extractor.

Pros-

minimal manual intervention

low on effort and time

can work on any scale

Cons-

Data quality compromised

inaccurate and incomplete datasets

lesser details suited only for high-level analyses

Suited for gathering- blogs, forums, news

Uses- Sentiment Analysis, Brand Monitoring, Competitor Analysis, Social Media Monitoring.

2. Low/Mid scale crawls; detailed datasets - If precise extraction is the mandate, there’s no going away from site-specific extractors. But realistically this is do-able only if your scope of work is limited i.e. few hundred sites or less. Using site-specific extractors, you could extract as many number of fields from any nook or corner of the web pages. Most of the times, most pages on a website share similar templates. If not, they can still be accommodated for using site-specific extractors.

cutlery

Designing extractor for each website

Pros-

High data quality

Better data coverage on the site

Cons-

High on effort and time

Site structures keep changing from time to time and maintaining these requires a lot of monitoring and manual intervention

Only for limited scale

Suited for gathering - any data from any domain on any site be it product specifications and price details, reviews, blogs, forums, directories, ticket inventories, etc.

Uses- Data Analytics for E-commerce, Business Intelligence, Market Research, Sentiment Analysis

Conclusion

Quite obviously you need both such extractors handy to take care of various use cases. The only way generic extractors can work for detailed datasets is if everyone employs standard data formats on the web (Read our post on standard data formats here). However, given the internet penetration to the masses and the variety of things folks like to do on the web, this is being overly futuristic.

So while site-specific extractors are going to be around for quite some time, the challenge now is to tweak the generic ones to work better. At PromptCloud, we have added ML components to make them smarter and they have been working well for us so far.

What have your challenges been? Do drop in your comments.

Source: https://www.promptcloud.com/blog/scraping-data-site-specific-extractors-vs-generic-extractors/

Wednesday, 12 November 2014

'Scrapers' Dig Deep for Data on Web

At 1 a.m. on May 7, the website PatientsLikeMe.com noticed suspicious activity on its "Mood" discussion board. There, people exchange highly personal stories about their emotional disorders, ranging from bipolar disease to a desire to cut themselves.

It was a break-in. A new member of the site, using sophisticated software, was "scraping," or copying, every single message off PatientsLikeMe's private online forums.

Enlarge Image

Bilal Ahmed wrote about his health on a site that was scraped. Andrew Quilty for The Wall Street Journal.

PatientsLikeMe managed to block and identify the intruder: Nielsen Co., the privately held New York media-research firm. Nielsen monitors online "buzz" for clients, including major drug makers, which buy data gleaned from the Web to get insight from consumers about their products, Nielsen says.

"I felt totally violated," says Bilal Ahmed, a 33-year-old resident of Sydney, Australia, who used PatientsLikeMe to connect with other people suffering from depression. He used a pseudonym on the message boards, but his PatientsLikeMe profile linked to his blog, which contains his real name.

After PatientsLikeMe told users about the break-in, Mr. Ahmed deleted all his posts, plus a list of drugs he uses. "It was very disturbing to know that your information is being sold," he says. Nielsen says it no longer scrapes sites requiring an individual account for access, unless it has permission.

Related Reading

    Digits: Escaping the 'Scrapers'
    Complete Coverage: What They Know

Journal Community

The market for personal data about Internet users is booming, and in the vanguard is the practice of "scraping." Firms offer to harvest online conversations and collect personal details from social-networking sites, résumé sites and online forums where people might discuss their lives.

The emerging business of web scraping provides some of the raw material for a rapidly expanding data economy. Marketers spent $7.8 billion on online and offline data in 2009, according to the New York management consulting firm Winterberry Group LLC. Spending on data from online sources is set to more than double, to $840 million in 2012 from $410 million in 2009.

The Wall Street Journal's examination of scraping—a trade that involves personal information as well as many other types of data—is part of the newspaper's investigation into the business of tracking people's activities online and selling details about their behavior and personal interests.

Some companies collect personal information for detailed background reports on individuals, such as email addresses, cell numbers, photographs and posts on social-network sites.

Others offer what are known as listening services, which monitor in real time hundreds or thousands of news sources, blogs and websites to see what people are saying about specific products or topics.

One such service is offered by Dow Jones & Co., publisher of the Journal. Dow Jones collects data from the Web—which may include personal information contained in news articles and blog postings—that help corporate clients monitor how they are portrayed. It says it doesn't gather information from password-protected parts of sites.

It's rarely a coincidence when you see Web ads for products that match your interests. WSJ's Christina Tsuei explains how advertisers use cookies to track your online habits.

The competition for data is fierce. PatientsLikeMe also sells data about its users. PatientsLikeMe says the data it sells is anonymized, no names attached.

Nielsen spokesman Matt Anchin says the company's reports to its clients include publicly available information gleaned from the Internet, "so if someone decides to share personally identifiable information, it could be included."

Internet users often have little recourse if personally identifiable data is scraped: There is no national law requiring data companies to let people remove or change information about themselves, though some firms let users remove their profiles under certain circumstances.

California has a special protection for public officials, including politicians, sheriffs and district attorneys. It makes it easier for them to remove their home address and phone numbers from these databases, by filling out a special form stating they fear for their safety.

Data brokers long have scoured public records, such as real-estate transactions and courthouse documents, for information on individuals. Now, some are adding online information to people's profiles.

Many scrapers and data brokers argue that if information is available online, it is fair game, no matter how personal.

"Social networks are becoming the new public records," says Jim Adler, chief privacy officer of Intelius Inc., a leading paid people-search website. It offers services that include criminal background checks and "Date Check," which promises details about a prospective date for $14.95.

"This data is out there," Mr. Adler says. "If we don't bring it to the consumer's attention, someone else will."

Scraping for Your Real Name

PeekYou.com has applied for a patent for a way to, among other things, match people's real names to pseudonyms they use on blogs, Twitter and online forums.

Read PeekYou.com's patent application.

Enlarge Image

New York-based PeekYou LLC has applied for a patent for a method that, among other things, matches people's real names to the pseudonyms they use on blogs, Twitter and other social networks. PeekYou's people-search website offers records of about 250 million people, primarily in the U.S. and Canada.

PeekYou says it also is starting to work with listening services to help them learn more about the people whose conversations they are monitoring. It says it hands over only demographic information, not names or addresses.

Employers, too, are trying to figure out how to use such data to screen job candidates. It's tricky: Employers legally can't discriminate based on gender, race and other factors they may glean from social-media profiles.

One company that screens job applicants for employers, InfoCheckUSA LLC in Florida, began offering limited social-networking data—some of it scraped—to employers about a year ago. "It's slowly starting to grow," says Chris Dugger, national account manager. He says he's particularly interested in things like whether people are "talking about how they just ripped off their last employer."

Scrapers operate in a legal gray area. Internationally, anti-scraping laws vary. In the U.S., court rulings have been contradictory. "Scraping is ubiquitous, but questionable," says Eric Goldman, a law professor at Santa Clara University. "Everyone does it, but it's not totally clear that anyone is allowed to do it without permission."

Scrapers and listening companies say what they're doing is no different from what any person does when gathering information online—they just do it on a much larger scale.

"We take an incomprehensible amount of information and make it intelligent," says Chase McMichael, chief executive of InfiniGraph, a Palo Alto, Calif., "listening service" that helps companies understand the likes and dislikes of online customers.

Scraping services range from dirt cheap to custom-built. Some outfits, such as 80Legs.com in Texas, will scrape a million Web pages for $101. One Utah company, screen-scraper.com, offers do-it-yourself scraping software for free. The top listening services can charge hundreds of thousands of dollars to monitor and analyze Web discussions.

Some scrapers-for-hire don't ask clients many questions.

"If we don't think they're going to use it for illegal purposes—they often don't tell us what they're going to use it for—generally, we'll err on the side of doing it," says Todd Wilson, owner of screen-scraper.com, a 10-person firm in Provo, Utah, that operates out of a two-room office. It is one of at least three firms in a scenic area known locally as "Happy Valley" that specialize in scraping.

Enlarge Image

Some of the computer code behind screen-scraper.com's software. Chris Detrick for The Wall Street Journal

Screen-scraper charges between $1,500 and $10,000 for most jobs. The company says it's often hired to conduct "business intelligence," working for companies who want to scrape competitors' websites.

One recent assignment: A major insurance company wanted to scrape the names of agents working for competitors. Why? "We don't know," says Scott Wilson, the owner's brother and vice president of sales. Another job: attempting to scrape Facebook for a multi-level marketing company that wanted email addresses of users who "like" the firm's page—as well as their friends—so they all could be pitched products.

Scraping often is a cat-and-mouse game between websites, which try to protect their data, and the scrapers, who try to outfox their defenses. Scraping itself isn't difficult: Nearly any talented computer programmer can do it. But penetrating a site's defenses can be tough.

One defense familiar to most Internet users involves "captchas," the squiggly letters that many websites require people to type to prove they're human and not a scraping robot. Scrapers sometimes fight back with software that deciphers captchas.

More From the Series

    Web's New Goldmine: Your Secrets

    Personal Details Exposed Via Biggest Websites

    Microsoft Quashed Bid to Boost Web Privacy

    On Web's Cutting Edge, Anonymity in Name Only

    Stalking by Cellphone

    Google Agonizes Over Privacy

    The Tracking Ecosystem

    On the Web, Children Face Intensive Tracking

Some professional scrapers stage blitzkrieg raids, mounting around a dozen simultaneous attacks on a website to grab as much data as quickly as possible without being detected or crashing the site they're targeting.

Raids like these are on the rise. "Customers for whom we were regularly blocking about 1,000 to 2,000 scrapes a month are now seeing three times or in some cases 10 times as much scraping," says Marino Zini, managing director of Sentor Anti Scraping System. The company's Stockholm team blocks scrapers on behalf of website clients.

At Monster.com, the jobs website that stores résumés for tens of millions of individuals, fighting scrapers is a full-time job, "every minute of every day of every week," says Patrick Manzo, global chief privacy officer of Monster Worldwide Inc. Facebook, with its trove of personal data on some 500 million users, says it takes legal and technical steps to deter scraping.

At PatientsLikeMe, there are forums where people discuss experiences with AIDS, supranuclear palsy, depression, organ transplants, post-traumatic stress disorder and self-mutilation. These are supposed to be viewable only by members who have agreed not to scrape, and not by intruders such as Nielsen.

"It was a bad legacy practice that we don't do anymore," says Dave Hudson, who in June took over as chief executive of the Nielsen unit that scraped PatientsLikeMe in May. "It's something that we decided is not acceptable, and we stopped."

Mr. Hudson wouldn't say how often the practice occurred, and wouldn't identify its client.

The Nielsen unit that did the scraping is now part of a joint venture with McKinsey & Co. called NM Incite. It traces its roots to a Cincinnati company called Intelliseek that was founded in 1997. One of its most successful early businesses was scraping message boards to find mentions of brand names for corporate clients.

In 2001, the venture-capital arm of the Central Intelligence Agency, In-Q-Tel Inc., was among a group of investors that put $8 million into the business.

Intelliseek struggled to set boundaries in the new business of monitoring individual conversations online, says Sundar Kadayam, Intelliseek's co-founder. The firm decided it wouldn't be ethical to use automated software to log into private message boards to scrape them.

But, he says, Intelliseek occasionally would ask employees to do that kind of scraping if clients requested it. "The human being can just sign in as who they are," he says. "They don't have to be deceitful."

In 2006, Nielsen bought Intelliseek, which had revenue of more than $10 million and had just become profitable, Mr. Kadayam says. He left one year after the acquisition.

At the time, Nielsen, which provides television ratings and other media services, was looking to diversify into digital businesses. Nielsen combined Intelliseek with a New York startup it had bought called BuzzMetrics.

The new unit, Nielsen BuzzMetrics, quickly became a leader in the field of social-media monitoring. It collects data from 130 million blogs, 8,000 message boards, Twitter and social networks. It sells services such as "ThreatTracker," which alerts a company if its brand is being discussed in a negative light. Clients include more than a dozen of the biggest pharmaceutical companies, according to the company's marketing material.

Like many websites, PatientsLikeMe has software that detects unusual activity. On May 7, that software sounded an alarm about the "Mood" forum.

David Williams, the chief marketing officer, quickly determined that the "member" who had triggered the alert actually was an automated program scraping the forum. He shut down the account.

The next morning, the holder of that account e-mailed customer support to ask why the login and password weren't working. By the afternoon, PatientsLikeMe had located three other suspect accounts and shut them down. The site's investigators traced all of the accounts to Nielsen BuzzMetrics.

On May 18, PatientsLikeMe sent a cease-and-desist letter to Nielsen. Ten days later, Nielsen sent a letter agreeing to stop scraping. Nielsen says it was unable to remove the scraped data from its database, but a company spokesman later said Nielsen had found a way to quarantine the PatientsLikeMe data to prevent it from being included in its reports for clients.

PatientsLikeMe's president, Ben Heywood, disclosed the break-in to the site's 70,000 members in a blog post. He also reminded users that PatientsLikeMe also sells its data in an anonymous form, without attaching user's names to it. That sparked a lively debate on the site about the propriety of selling sensitive information. The company says most of the 350 responses to the blog post were supportive. But it says a total of 218 members quit.

In total, PatientsLikeMe estimates that the scraper obtained about 5% of the messages in the site's forums, primarily in "Mood" and "Multiple Sclerosis."

Source: http://online.wsj.com/articles/SB10001424052748703358504575544381288117888