Database Schema documentation

This document aims to provide an introduction to the DB schema of Piwik, explaining the different tables and architecture choices, and linking to the related source files when necessary

The schema

source:/trunk/misc/db-schema.png

The image was generated using DBDesigner and the XML source schema can be found on the piwik repository

Introduction

The database has been designed with emphasis on simplicity, efficiency and data modularity.

The database contains different sections

  • Statistics logger
  • Users & Permissions
  • Site
  • Archived data
  • Debug / Info log
  • SQL query profiling

Database sections

Statistics logger

This part contains the information about the visitors logged by the tracker. Some information is obtained by the Javascript tag, some stored in the cookie, some by PHP. Each unique visitor is assigned a unique visitor_idcookie that is saved in a cookie. We then create a line in log_visit for each visit. For example, if the visitor visits the website twice in the day with more than 30 minutes in between the two visits, there will be two rows in the table log_visit for this visitor.

During a visit, a visitor hits at least one page, called "action". An action is being defined by a name ("homepage", "/blog/hello-world") and a type (an integer that defines page / download / etc). See the class Piwik_LogStats_Action for more information.

All unique actions are saved in the table log_action.

A new action by a visitor creates a record in log_link_visit_action containing the idaction and the idvisit. Also, this table contains a field idaction_ref and time_spent_ref_action which are refering to the previous action the visitor did. For example when hitting the second page, the algorithm will read from the cookie the idaction of the first page, then save this as the action_ref, and save the time spent during the first action as the time_spent_ref_action. This technique is used because we don't have to update a previous record: we simply save the information about the referer page in the current record.

Related classes: Piwik_Visit, Piwik_LogStats and the files located in modules/LogStats. The tracker entry point called by the javascript tag is the file piwik.php.

Users & Permissions

A user is defined by a login, password, email. A token_auth is generated and is used to sign API calls.

A user has an access level ('view' or 'admin' or 'no access' or the user is 'super admin' and has admin access by default) on a given idsite.

Related classes: Piwik_Access, Piwik_Login

Site

A website is defined by an idsite, a main_url and is linked to site_url so it can have several alias url.

Related classes: Piwik_Site, and the API for websites Piwik_SitesManager_API

Archived data

An archive in piwik is the aggregate of data for a given period. It's the result of the logs (See the section Statistics Logger) being processed into meaningful data.

A row in this archive_* table contains some data for a given date / period on a website. For example a record could contain the list of countries on the website idsite = 3 on the week of the January, 7th 2008.

There are two different tables because we have two data types possible in piwik archives:

  • float
  • blob

The table archive_numeric_* is used to store plain numbers. The value field has a FLOAT type which means you can save integers and floats. For example it is used to record the number of visitors over a given period, the number of distinct search engines keywords used, etc.

The table archive_blob_* stores anything that is not a number. A BLOB is a binary data type that can contain anything from strings, compressed strings to serialized arrays, serialized objects, etc. For example it is used to store the search engine keywords that the visitors used over a given period, the visitors' browsers, etc.

Both table have exactly the same structure except the type of the value field (BLOB in one case, FLOAT in the other). The structure has the following fields:

  • idarchive defines a unique archive. All data for a specific website over a specific period (day/week/etc.) for a specific date will have the same idarchive. In other words this idarchive is the same as if (idsite,period,date1,date2) was the primary key.
  • name is the description of the value of the record. For example if you store the number of distinct keywords used a pertinent name could be 'Referers_distinctKeywords'
  • idsite is the website for which the record refers to
  • date1 and date2 are the starting and ending dates for which the record refers to. If the archive refers to a sigle day, date1 = date2. The class handling the date logic is Piwik_Date.
  • period defines the period type: day / week / month / year. All Period related logic is located in the classes Piwik_Period*
  • ts_archived is the timestamp when the archive was built. It is useful in the case we want to know if an archive is still valid or not ; for example today's archive can be valid for 1 hour or 1 minute depending on the cache lifetime value
  • value contains the data which description is name

A record (row) in these archive tables is automatically handled by the classes Piwik_ArchiveProcessing_Record*. There are different classes in this file: the record manager Piwik_ArchiveProcessing_Record_Manager, the numeric record Piwik_ArchiveProcessing_Record_Numeric, the blob record Piwik_ArchiveProcessing_Record_Blob and the array of blob records Piwik_ArchiveProcessing_Record_Blob_Array.

The Archiving logic can be found in the class Piwik_ArchiveProcessing, day archiving is Piwik_ArchiveProcessing_Day and period archiving is Piwik_ArchiveProcessing_Period. It is important to note that most of the real archiving processing is actually done within plugins that hook on special events (see such hooks in ArchiveProcessing_Day search for the call Piwik_PostEvent()).

Loading an Archive (and possibly launch the archive processing if necessary) is done via Piwik_Archive.

For performance reasons, the tables are partitionned by month. It means that new tables will be created for every month, so the data is evenly partitioned. If we had only one table containing all the data, it would become huge and lookups would be very slow. Partitioning is done by the class Piwik_TablePartitioning. See also the monthly partitioning class used for this table: Piwik_TablePartitioning_Monthly.

We use a different table structure for FLOAT and BLOB because it makes it very fast to lookup the integer/float values. The SQL SELECT are very fast because the tables are light (and the rows of archive_numeric_* have a fixed length). For example we need to select the number of visitors for the last 30 days very quickly.

Debug / Info log

The tables logger_error, logger_message, logger_api_call and logger_exception are used to log misc information.

  • logger_error is used to log error message along with their information (line, php file errfile, backtrace, etc.) raised using Piwik_Log_Error (see also the specific error handler function used)
  • logger_message is used to log any debug / info message raised using Piwik_Log_Message
  • logger_api_call is used to log all the information related to api calls. It will log all the parameter's values passed (parameter_values), the result of the call (returned_value), the execution_time, the IP of the user calling the API (caller_ip), etc. This information is then useful for profiling the API calls, debugging (when the returned_value was not right for example), monitoring the API usage. See the class Piwik_Log_APICall
  • logger_exception is used to log all exceptions that are raised using Piwik_Log_Exception (see also the specific exception handler function used).

The general Log logic is done in Piwik_Log (it's using the library Zend_Log).

SQL query profiling

The table log_profiling is used to store profiling information about the SQL queries. You can enable profiling using Piwik_LogStats_Db::enableProfiling() and output the profiling analysis using Piwik::printLogStatsSQLProfiling().

Other

Notes

  • All the timestamp in the DB are generated by PHP and not using the Mysql functions such as NOW(), CURRENT_DATE(), etc. This is to make sure that the system works well if the Mysql server is on a different box with a different time. The algorithms are not dependant of the mysql server time.

Future evolutions, Optimizations

There is room for a lot of improvement on the performance side, as the software has not yet been tested with huge datasets (see #37).

For example, we could

  • partition the log_* tables by date (the same way archive_* tables are partitioned
  • review all the SQL queries under heavy load and with a lot of data to archive
  • review all the index strategies
  • to avoid huge joins between (log_visit, log_link_visit_action) we could denormalize the tables by duplicating the (idsite, visit_server_date) fields
  • ...