Companies that use scanning tools to search for data must be careful to avoid violating privacy laws.
Increasingly, organizations are using automated tools to scan and collect information online. They’re looking at sites such as social networks and blogs for reasons such as reputation management, public relations, market research, and background checks.
Tools that can automatically scroll for data known as screen scrapers are also becoming more advanced, but companies that use them must avoid legal pitfalls, which could include personal privacy violations as well as copyright infringement.
Social networking and other sites that collect user-generated data should also take steps to protect data on their sites, including establishing appropriate privacy policies and implementing the appropriate technical security measures.
The laws surrounding screen scraping and possible privacy and intellectual property violations are somewhat murky, said Brian Bowman, a partner at the law firm Pitblado. Bowman spoke at the Global Privacy Summit in Washington, D.C., sponsored by the Independent Association of Privacy Professionals.
In the United States, one interpretation of the law is that protected information doesn’t include information in a forum where a user voluntarily shared it, where it’s publicly available, and where users have not been led to believe that there are any technical controls limiting public access, he said. But it is fairly clear that it isn’t acceptable to collect information provided by children or from sites that are aimed at children. In other countries, such as Canada, the laws may be stricter regarding “expectations relating to publicly available information,” Bowman said.
The issue is growing in importance as tools to scrape screens for data are becoming more common and powerful, said Joanne Furtsch, policy and product architect at TRUSTe. Whereas much market data and research used to be collected by telephone, such data collection has been surpassed by online-based research, according to Furtsch.
Executives should also assess whether data being collected may be sensitive or personally identifiable information under state and national national laws, said Bowman. Companies should consider applying filters that can remove names from data.
Social networking sites and blogs should be sure to let consumers know, in their privacy policies and other areas, how the information they share on the site could be collected, said Furtsch. Such sites should also let potential screen scrapers know what information they’ll allow to be scraped. Some sites, such as Facebook, forbid any kind of automated data collecting, even if it’s by a user collecting data from his or her own account.
One technical measure that can protect against many scrapers is the robot.txt command, a text file that can give instructions to Web robots, said Furtsch. It has a serious limitation, however. In most cases, screen scrapers must choose to find the file and read its instructions in order for the text file to be effective; malicious bots likely won’t seek the file.
Another measure sites should take is to provide their users with mechanisms for deleting their sensitive data whenever they choose.
Another widely used tool to protect against scrapers are known as captchas. They show squiggly letters and numbers that a computer or bot cannot decipher. Sites have people type the captcha during registration to prove that they’re human.
Captchas should also be regularly updated, Furtsch said, as some scraping tools have been known to outsmart certain types of captchas.