Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a anonymize IP addresses setting #692

Closed
anonymous-matomo-user opened this issue May 5, 2009 · 24 comments
Closed

Add a anonymize IP addresses setting #692

anonymous-matomo-user opened this issue May 5, 2009 · 24 comments
Assignees
Labels
Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Milestone

Comments

@anonymous-matomo-user
Copy link

In Germany, Piwik would be a much better alternative to Google Analytics, if there would be an option to deactivate the storing of ip-addresses. Maybe ip-addresses could optionally be saved "hashed" instead of "cleartext"?

Background: In Germany the situation is not clear - maybe storing of ip-addresses is not allowed because it harms privacy. See: [http://forum.piwik.org/index.php?showtopic=825].

Piwik would be the first software I know with this "Feature" - so every German company could use it without probably getting problems with privacy law.

@robocoder
Copy link
Contributor

  • Feature request implies adding another column to store the hash value because location_ip is a bigint -- big enough to accommodate an ipv6 address, but not big enough for the output of popular hash algorithms, such as md5.
  • location_ip column probably should not be anonymized until it has been archived and is no longer needed.
  • Is table pruning a dependency?
  • Caveats may need to be attached to this feature as it could cripple other planned features on the product roadmap.

(Do German companies disable their firewall and web server log files too?)

@anonymous-matomo-user
Copy link
Author

Replying to vipsoft:

(Do German companies disable their firewall and web server log files too?)

Well, some court decisions say the IP adress is a personal information (at least you can track it down to the ISP customer) some say it is not. And still we have some fragments of data privacy... However, i don't think that most companies swtich off the logging of ip adresses, but it is needed for different reasons (intrusion detection e.g.). Also, most forums and other such software store the ip address too. Btw, any staff member who has insight to private data of customers has to sign a confidentiality undertaking.

The problem with Google Analytics and any outsourced web analytics solution is that private date is given away to other companies. This may be okay in Germany. This could work in the EU. But it is risky when transfered to other countries with a lower or quite different privacy policy. Thus, some institutions have the opinion that the usage of Google Analytics in Germany is illegal (http://www.internetworld.de/Nachrichten/News/Datenschuetzer-halten-Google-Analytics-fuer-rechtswidrig).

All told, the judicial situation is vague and unclear. I'd say as long as you don't give the data away and have it stored savely, you have nothing to fear.

Admittedly, I am not a lawyer.

@anonymous-matomo-user
Copy link
Author

(Do German companies disable their firewall and web server log files too?)
I think most of them don't do this. But it would be more secure for a company to do this, if the company wants to be sure not to violate privacy.

@anonymous-matomo-user
Copy link
Author

If we store a hash of the IP address, I think the entry can still be related back to the IP. If an authority seized a server running Piwik, they would still be able to prove that a certain person has accessed it by just calculating the hash of the suspect's IP and comparing it to the database. So privacy does not just mean protection from the owner of the server.

My suggestion:
Just store a truncated hash of the IP.

Code example:

$ip='10.11.12.13';
$ipAsLong=ip2long($ip);
$ipAsHash=hexdec(md5($ipAsLong));
$anonIpAsLong=substr(number_format($ipAsHash,0,',',''),0,9);
$backToIp=long2ip($anonIpAsLong); // Result: 25.68.34.54

This way, Piwik would still have an IP to work with. But it cannot be related back to a real IP, because the hash is not complet. Yes, hash collisions will probably occur. But as the IP address is only one part of the user identification, it will not occur very often that me miss a visitor. Plus there's no need to change the table for the large full hash.

I really don't know anything about Piwik's plugin system, but in #312 it looks pretty easy to provide this kind of functionality.

Is this stupid? Will this break any other features?

@anonymous-matomo-user
Copy link
Author

Looking at the code, I'm afraid it is not possible to catch all access to the IP address with plugin hooks. So the change should probably be done in the core, which seems to have been turned down already on #312. Still, this could be a quick fix for paranoid people:

In core/Common.php, change the getIp() function like this

static public function getIp(){
return sprintf("%u", (int)substr(number_format(hexdec(md5(ip2long(self::getIpString()))),0,',',''),0,9) );
}

@robocoder
Copy link
Contributor

joux: it should be possible to catch all accesses (now that #825 [1344] has been committed to SVN).

As for a truncated hash... In the case where an "authority" seizes a server, then they inherently have the authority to inspect more than just the database, right? (e.g., server logs) Hashing would seem to strike a balance between user privacy and cooperating with law enforcement.

That said... a requirement for this plugin should be to implement a framework for anonymizing IP addresses. Site operators can then customize/extend the implementation to suit their needs, since anonymity and functionality are (roughly) inversely proportional to each other.

@anonymous-matomo-user
Copy link
Author

As far as I've seen, the IP from the database is already used in recognizeTheVisitor(), in order to decide whether the user is known. So neither Tracker.newVisitorInformation nor Tracker.knownVisitorInformation have been called so far. In order to keep the (anonymized) IP as a possibility to recognize a user, the matched IP must be hashed+truncated at that point, too.

Maybe yet another hook in getUserSettingsInformation() would allow for a generic filtering possibility before the data is used/saved anywhere. But that's just a quick guess.

@anonymous-matomo-user
Copy link
Author

I would like to support the feature request and maybe add some hint for easier implementation:

  1. In Germany, storing of IP-addresses is not allowed. I just read of an order of the Berlin data protection commissioner prohibiting a blogger to store the ip-adresses. Thus, using Piwik poses the risk of a regulatory offense in Germany.
  2. If the only problem using a hash of the IP-address is the limitation to a BigInt, then you could just make it fit (similar to what joux proposed):
$ip_hash = md5($ip) MOD 2^64; // make $ip_hash fit into a BigInt which is of 8 bytes size.

Can anyone tell me, where in the core I would have to change this? I want to continue using Piwik but want to comply with German law as well...

@anonymous-matomo-user
Copy link
Author

The discussion about the use of Google Analytics in Germany is being continued. There will probably be no legal restrictions against it, but a workgroup of privacy boards seems to be working on a list of recommendations for website owners, that excludes the use of GA. (German source: http://www.zeit.de/digital/datenschutz/2009-11/google-analytics-datenschutz?page=all)

This means that the interest in a self-hosted statistics tool like Piwik could rise soon, if it complies with the recommendations (does not permanently store IP addresses).

Could we move this feature request to an earlier milestone, like 0.6?

@mattab
Copy link
Member

mattab commented Dec 8, 2009

There is a lot of interest from German users to make this happen. If Piwik can be the solution for german websites, this would greatly help the german community and would help Piwik. However, I would recommend doing this in core with a enable/disable setting rather than in a plugin.

@anonymous-matomo-user
Copy link
Author

Attachment:
suggestion.diff

@anonymous-matomo-user
Copy link
Author

Thank you for changing the milestone.

I made a quick try to make a patch out of my above suggestion. Maybe it helps as a suggestion only, as I'm not a programmer.

@robocoder
Copy link
Contributor

The solution is non-trivial.

  • Piwik_Common::getIp() is a general utility function, not exclusively for the Tracker.
  • if cookies are blocked, truncating the last octet may result in incorrect identification of unique visitors.
  • Truncating the last octet may be insufficient given that GA does this already and is still being criticized.

@anonymous-matomo-user
Copy link
Author

Google Analytics is being criticized mostly because the servers are outside the EU and Google will cooporate with legal authorities worldwide, who are not all complying to the laws and regulations of the EU. Additionally, Google will give no guaranty that it will not use your data for their own purpose. Here Piwik is already leading the discussion :-)

But: Storing the IP address is not recommended, no matter wether talking about Google or any other webtracking, wether they are based in Germany or anywhere else (like eTracker or WebTrekk). According to a resolution (see below), it is not even allowed to process IP addresses, less storing them. This results in

  • IP addresses not being processed and being shortened before being stored
  • no domain addresses being processed or stored or shown
  • no geoIP
  • no analysis of ISP data (name, kind of access like DSL, dial up
  • no domains

The resolution, issued by the "obersten Aufsichtsbehrden fr den Datenschutz im nicht-ffentlichen Bereich", can be found here at [http://www.lfd.m-v.de/dschutz/beschlue/Analyse.pdf].

German webtracking companies like eTracker allow their customers already a setting conforming to this paper.

Since I am not a lawer, I have no idea as to how binding this resolution is or if you still have a choice or what will happen if you don't adhere to it.

But I think it would be a good idea to admins users the choice like some companies do, and I would prefer to be able to do this in the core settings (no plugin).

@anonymous-matomo-user
Copy link
Author

Couldn't this be done in the core quite easily by modifying just one critical line of code? I'm not sure if I reference the correct line in my examples, but there must be a line of code in Piwik which accesses the IP address the first time. So I'll continue without knowing whether I'm talking about the right code example. Please read the following just as an informal proposal guessing this would be the correct line in the code :-) .

  1. The 100% solution would be to throw the IP completely away the first time Piwik sees it. According to some blog posts in the www this happens in /core/Tracker/Visit.php (this is pure theory, I absolutely have no clue ;-) ):
'location_ip' => $userInfo['location_ip'],

=>

'location_ip' => 0,

So there would be an empty location_ip field in the database. As far as I read, Piwik would still be working except of a few plugins.
I think it would be very cool to have an option "Throw IP away" (for hardliner germans :-) which would be quite a few I guess).

  1. A medium approach could be the use of hashes. I don't think, it's necessary to store the complete output of MD5 to avoid all collisions (mathematically there also can be collisions in the 16 byte md5 output of PHP). So I propose to accept some risk of collisions and to store the hash-output in the location_ip:
'location_ip' => hexdec(substr(md5($userInfo['location_ip']), -8)),

The code is completely untested and shall only demonstrate the idea! Taking the last 4 bytes or the first ones shouldn't make a difference: On modification of one bit in the input of a valid hash function every bit in the output flips with the same probability of 50%.
This approach could be expressed in an option "Use 4 byte hash".

  1. I would take one further step and use 1 of the 4 bytes of the hashed IP to mark that this is no valid IP address. E.g. this could be achieved by using the 0.X.X.X/8 subnet:
'location_ip' => hexdec(substr(md5($userInfo['location_ip']), -6)),

This approach ("Use 3 byte hash") would result in more then 16 million possible hash values which should be enough for the most use cases I guess.

I really appreciate your work and I think this would be a great feature for many many german sites especially in the governmental area. Your approach of logging already matches the requirements of german privacy laws very good. The IP killing feature would be the perfection no other approach could guarantee that easy.

@anonymous-matomo-user
Copy link
Author

I really miss the point of saving the hash.
I wrote a plugin ( #1168 ) that trims the IP right before the data is saved to the database. The hashing over configuration and IP (to recognize users in "recognizeTheVisitor") is done before - there is no conflict. I've written a post to explain my arguments (english & german).

@robocoder
Copy link
Contributor

(In [1877]) fixes #692 - plugin (deactivated by default) to anonymize visitor IP addresses; the number of octets to mask is configurable; let me know if I've missed any edge cases in the unit tests

@anonymous-matomo-user
Copy link
Author

Great, thanks. Works as expected.

@anonymous-matomo-user
Copy link
Author

Great, thank you!

@mattab
Copy link
Member

mattab commented Mar 5, 2010

great Anthon! this will make the German very happy.

@anonymous-matomo-user
Copy link
Author

Thank you, but where do I configure the number of octets?

@anonymous-matomo-user
Copy link
Author

The Live Visitor-Plugin still shows IPs :-(

@anonymous-matomo-user
Copy link
Author

@jimbo
Are you sure that you are working with Piwik 0.5.5? The IPs shouldn't be listed in the Live Visitor-Plugin anymore.

You can define the number of octets in your config.ini.php:
[Tracker]
ip_address_mask_length = 2

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement For new feature suggestions that enhance Matomo's capabilities or add a new report, new API etc.
Projects
None yet
Development

No branches or pull requests

3 participants