Opened 5 years ago

Closed 4 years ago

Last modified 4 years ago

#692 closed New feature (fixed)

Add a anonymize IP addresses setting

Reported by: webzigartig Owned by: vipsoft
Priority: normal Milestone: Piwik 0.5.5
Component: Core Keywords:
Cc: Sensitive: no

Description

In Germany, Piwik would be a much better alternative to Google Analytics, if there would be an option to deactivate the storing of ip-addresses. Maybe ip-addresses could optionally be saved "hashed" instead of "cleartext"?

Background: In Germany the situation is not clear - maybe storing of ip-addresses is not allowed because it harms privacy. See: http://forum.piwik.org/index.php?showtopic=825.

Piwik would be the first software I know with this "Feature" - so every German company could use it without probably getting problems with privacy law.

Attachments (1)

suggestion.diff (920 bytes) - added by joux 4 years ago.

Download all attachments as: .zip

Change History (26)

comment:1 follow-ups: Changed 5 years ago by vipsoft (robocoder)

  • Milestone set to Features requests - after Piwik 1.0
  • Summary changed from Option to deactivate storing of ip-addresses/to store ip-addresses anonymized to Plugin: Anonymize IP addresses
  • Feature request implies adding another column to store the hash value because location_ip is a bigint -- big enough to accommodate an ipv6 address, but not big enough for the output of popular hash algorithms, such as md5.
  • location_ip column probably should not be anonymized until it has been archived and is no longer needed.
  • Is table pruning a dependency?
  • Caveats may need to be attached to this feature as it could cripple other planned features on the product roadmap.

(Do German companies disable their firewall and web server log files too?)

comment:2 in reply to: ↑ 1 Changed 5 years ago by Blizzz

Replying to vipsoft:

(Do German companies disable their firewall and web server log files too?)

Well, some court decisions say the IP adress is a personal information (at least you can track it down to the ISP customer) some say it is not. And still we have some fragments of data privacy... However, i don't think that most companies swtich off the logging of ip adresses, but it is needed for different reasons (intrusion detection e.g.). Also, most forums and other such software store the ip address too. Btw, any staff member who has insight to private data of customers has to sign a confidentiality undertaking.

The problem with Google Analytics and any outsourced web analytics solution is that private date is given away to other companies. This may be okay in Germany. This could work in the EU. But it is risky when transfered to other countries with a lower or quite different privacy policy. Thus, some institutions have the opinion that the usage of Google Analytics in Germany is illegal (http://www.internetworld.de/Nachrichten/News/Datenschuetzer-halten-Google-Analytics-fuer-rechtswidrig).

All told, the judicial situation is vague and unclear. I'd say as long as you don't give the data away and have it stored savely, you have nothing to fear.

Admittedly, I am not a lawyer.

comment:3 in reply to: ↑ 1 Changed 5 years ago by webzigartig

(Do German companies disable their firewall and web server log files too?)

I think most of them don't do this. But it would be more secure for a company to do this, if the company wants to be sure not to violate privacy.

comment:4 Changed 5 years ago by joux

If we store a hash of the IP address, I think the entry can still be related back to the IP. If an authority seized a server running Piwik, they would still be able to prove that a certain person has accessed it by just calculating the hash of the suspect's IP and comparing it to the database. So privacy does not just mean protection from the owner of the server.

My suggestion:
Just store a truncated hash of the IP.

Code example:

$ip='10.11.12.13';
$ipAsLong=ip2long($ip);
$ipAsHash=hexdec(md5($ipAsLong));
$anonIpAsLong=substr(number_format($ipAsHash,0,',',''),0,9);
$backToIp=long2ip($anonIpAsLong); // Result: 25.68.34.54

This way, Piwik would still have an IP to work with. But it cannot be related back to a real IP, because the hash is not complet. Yes, hash collisions will probably occur. But as the IP address is only one part of the user identification, it will not occur very often that me miss a visitor. Plus there's no need to change the table for the large full hash.

I really don't know anything about Piwik's plugin system, but in #312 it looks pretty easy to provide this kind of functionality.

Is this stupid? Will this break any other features?

comment:5 Changed 5 years ago by joux

Looking at the code, I'm afraid it is not possible to catch all access to the IP address with plugin hooks. So the change should probably be done in the core, which seems to have been turned down already on #312. Still, this could be a quick fix for paranoid people:

In core/Common.php, change the getIp() function like this

static public function getIp(){
return sprintf("%u", (int)substr(number_format(hexdec(md5(ip2long(self::getIpString()))),0,',',''),0,9) );
}

comment:6 Changed 5 years ago by vipsoft (robocoder)

joux: it should be possible to catch all accesses (now that #825 [1344] has been committed to SVN).

As for a truncated hash... In the case where an "authority" seizes a server, then they inherently have the authority to inspect more than just the database, right? (e.g., server logs) Hashing would seem to strike a balance between user privacy and cooperating with law enforcement.

That said... a requirement for this plugin should be to implement a framework for anonymizing IP addresses. Site operators can then customize/extend the implementation to suit their needs, since anonymity and functionality are (roughly) inversely proportional to each other.

comment:7 Changed 5 years ago by joux

As far as I've seen, the IP from the database is already used in recognizeTheVisitor(), in order to decide whether the user is known. So neither Tracker.newVisitorInformation nor Tracker.knownVisitorInformation have been called so far. In order to keep the (anonymized) IP as a possibility to recognize a user, the matched IP must be hashed+truncated at that point, too.

Maybe yet another hook in getUserSettingsInformation() would allow for a generic filtering possibility before the data is used/saved anywhere. But that's just a quick guess.

comment:8 Changed 5 years ago by domtop

comment:9 Changed 4 years ago by Blackhole

I would like to support the feature request and maybe add some hint for easier implementation:

  1. In Germany, storing of IP-addresses is not allowed. I just read of an order of the Berlin data protection commissioner prohibiting a blogger to store the ip-adresses. Thus, using Piwik poses the risk of a regulatory offense in Germany.
  1. If the only problem using a hash of the IP-address is the limitation to a BigInt, then you could just make it fit (similar to what joux proposed):
$ip_hash = md5($ip) MOD 2^64; // make $ip_hash fit into a BigInt which is of 8 bytes size.

Can anyone tell me, where in the core I would have to change this? I want to continue using Piwik but want to comply with German law as well...

comment:10 Changed 4 years ago by joux

The discussion about the use of Google Analytics in Germany is being continued. There will probably be no legal restrictions against it, but a workgroup of privacy boards seems to be working on a list of recommendations for website owners, that excludes the use of GA. (German source: http://www.zeit.de/digital/datenschutz/2009-11/google-analytics-datenschutz?page=all)

This means that the interest in a self-hosted statistics tool like Piwik could rise soon, if it complies with the recommendations (does not permanently store IP addresses).

Could we move this feature request to an earlier milestone, like 0.6?

comment:11 Changed 4 years ago by matt (mattab)

  • Milestone changed from Features requests - after Piwik 1.0 to 2 - Piwik 0.6 - DigitalVibes
  • Sensitive unset
  • Summary changed from Plugin: Anonymize IP addresses to Add a anonymize IP addresses setting

There is a lot of interest from German users to make this happen. If Piwik can be the solution for german websites, this would greatly help the german community and would help Piwik. However, I would recommend doing this in core with a enable/disable setting rather than in a plugin.

Changed 4 years ago by joux

comment:12 Changed 4 years ago by joux

Thank you for changing the milestone.

I made a quick try to make a patch out of my above suggestion. Maybe it helps as a suggestion only, as I'm not a programmer.

comment:13 follow-up: Changed 4 years ago by vipsoft (robocoder)

The solution is non-trivial.

  • Piwik_Common::getIp() is a general utility function, not exclusively for the Tracker.
  • if cookies are blocked, truncating the last octet may result in incorrect identification of unique visitors.
  • Truncating the last octet may be insufficient given that GA does this already and is still being criticized.

comment:14 Changed 4 years ago by jimbo

Google Analytics is being criticized mostly because the servers are outside the EU and Google will cooporate with legal authorities worldwide, who are not all complying to the laws and regulations of the EU. Additionally, Google will give no guaranty that it will not use your data for their own purpose. Here Piwik is already leading the discussion :-)

But: Storing the IP address is not recommended, no matter wether talking about Google or any other webtracking, wether they are based in Germany or anywhere else (like eTracker or WebTrekk). According to a resolution (see below), it is not even allowed to process IP addresses, less storing them. This results in

  • IP addresses not being processed and being shortened before being stored
  • no domain addresses being processed or stored or shown
  • no geoIP
  • no analysis of ISP data (name, kind of access like DSL, dial up
  • no domains

The resolution, issued by the "obersten Aufsichtsbehörden für den Datenschutz im nicht-öffentlichen Bereich", can be found here at http://www.lfd.m-v.de/dschutz/beschlue/Analyse.pdf.

German webtracking companies like eTracker allow their customers already a setting conforming to this paper.

Since I am not a lawer, I have no idea as to how binding this resolution is or if you still have a choice or what will happen if you don't adhere to it.

But I think it would be a good idea to admins users the choice like some companies do, and I would prefer to be able to do this in the core settings (no plugin).

comment:15 in reply to: ↑ 13 Changed 4 years ago by tillsc

Couldn't this be done in the core quite easily by modifying just one critical line of code? I'm not sure if I reference the correct line in my examples, but there must be a line of code in Piwik which accesses the IP address the first time. So I'll continue without knowing whether I'm talking about the right code example. Please read the following just as an informal proposal guessing this would be the correct line in the code :-) .

1) The 100% solution would be to throw the IP completely away the first time Piwik sees it. According to some blog posts in the www this happens in /core/Tracker/Visit.php (this is pure theory, I absolutely have no clue ;-) ):

'location_ip' => $userInfo['location_ip'],

=>

'location_ip' => 0,

So there would be an empty location_ip field in the database. As far as I read, Piwik would still be working except of a few plugins.
I think it would be very cool to have an option "Throw IP away" (for hardliner germans :-) which would be quite a few I guess).

2) A medium approach could be the use of hashes. I don't think, it's necessary to store the complete output of MD5 to avoid all collisions (mathematically there also can be collisions in the 16 byte md5 output of PHP). So I propose to accept some risk of collisions and to store the hash-output in the location_ip:

'location_ip' => hexdec(substr(md5($userInfo['location_ip']), -8)),

The code is completely untested and shall only demonstrate the idea! Taking the last 4 bytes or the first ones shouldn't make a difference: On modification of one bit in the input of a valid hash function _every_ bit in the output flips with the same probability of 50%.
This approach could be expressed in an option "Use 4 byte hash".

3) I would take one further step and use 1 of the 4 bytes of the hashed IP to mark that this is no valid IP address. E.g. this could be achieved by using the 0.X.X.X/8 subnet:

'location_ip' => hexdec(substr(md5($userInfo['location_ip']), -6)),

This approach ("Use 3 byte hash") would result in more then 16 million possible hash values which should be enough for the most use cases I guess.

I really appreciate your work and I think this would be a great feature for many many german sites especially in the governmental area. Your approach of logging already matches the requirements of german privacy laws very good. The IP killing feature would be the perfection no other approach could guarantee that easy.

comment:17 Changed 4 years ago by martin_LE

I really miss the point of saving the hash.
I wrote a plugin ( #1168 ) that trims the IP right before the data is saved to the database. The hashing over configuration and IP (to recognize users in "recognizeTheVisitor") is done before - there is no conflict. I've written a post to explain my arguments (english & german).

comment:18 Changed 4 years ago by vipsoft (robocoder)

  • Milestone changed from 2 - Piwik 0.6 - DigitalVibes to 1 - Piwik 0.5.5
  • Owner set to vipsoft

comment:19 Changed 4 years ago by vipsoft (robocoder)

  • Resolution set to fixed
  • Status changed from new to closed

(In [1877]) fixes #692 - plugin (deactivated by default) to anonymize visitor IP addresses; the number of octets to mask is configurable; let me know if I've missed any edge cases in the unit tests

comment:20 Changed 4 years ago by martin_LE

Great, thanks. Works as expected.

comment:21 Changed 4 years ago by webzigartig

Great, thank you!

comment:22 Changed 4 years ago by matt (mattab)

great Anthon! this will make the German very happy.

comment:23 Changed 4 years ago by jimbo

Thank you, but where do I configure the number of octets?

comment:24 Changed 4 years ago by jimbo

The Live Visitor-Plugin still shows IPs :-(

comment:25 Changed 4 years ago by martin_LE

@jimbo
Are you sure that you are working with Piwik 0.5.5? The IPs shouldn't be listed in the Live Visitor-Plugin anymore.

You can define the number of octets in your config.ini.php:
[Tracker]
ip_address_mask_length = 2

Note: See TracTickets for help on using tickets.