Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the "Database usage" plugin, more statistics #3004

Closed
mattab opened this issue Mar 3, 2012 · 20 comments
Closed

Improve the "Database usage" plugin, more statistics #3004

mattab opened this issue Mar 3, 2012 · 20 comments
Labels
Bug For errors / faults / flaws / inconsistencies etc. Critical Indicates the severity of an issue is very critical and the issue has a very high priority.
Milestone

Comments

@mattab
Copy link
Member

mattab commented Mar 3, 2012

It would be nice to see some stats about Piwik DB usage, mostly for debug but also could be useful to the user for long term Mysql scaling and general info.

Here are the stats we could display:

  • Current Total size is already displayed but at the bottom of table
    • Total for tracking
    • Total for archiving
    • Total for core tables
  • total visits, pages views, unique actions, conversions, ecommerce products

  • of websites, users, reports, etc.

  • Sized used by the archive tables:
    • total blob VS total numeric
    • also the sum of archive_* tables by year
  • Estimate of future growth of archive tables for example "We estimate that DB size will approximately grow X Gb bigger (~YGb total) in the next 12 months (based on your DB size from last year and measured data growth rate".
  • If logs delete is not enabled, we could write: "If you wish to save space, you could consider deleting old logs from the database, this would ensure your logging database does not grow out of control".

This message should give a short overview of the Piwik DB, in terms of general usage data, and DB focused summary for capacity planning.

See also #3003, #53, #2805 to help around this issue.

@diosmosis
Copy link
Member

Attachment: Patch for this issue.
3004.diff.tar.gz

@diosmosis
Copy link
Member

I've uploaded a patch for this issue (a redesign of the DBStats plugin). It's a pretty big patch, so I think it should be reviewed. Let me know what you think.

@mattab
Copy link
Member Author

mattab commented May 7, 2012

Wow, very nice UI and useful reports!

Sorry for the delay in replying to the ticket. Here is my code review and feedback:

  • Great work overall and nice new feature! This page will make debugging any big piwik install a much nicer and enjoyable experience. Well done! I can see it becoming the start of a dashboard we can keep extending with key metrics.
  • How long does it take to generate the exact size per report on a large install with 100k rows in a blob table for example?
  • + 'CoreHome_RelatedReports' => 'Related reports',
    could be General_RelatedReports as it will be used def in Piwik mobile etc.
  • all the strings:

+   'DBStats_ReportData' => 'Report Data',
+   'DBStats_ReportTables' => 'Report Tables',
+   'DBStats_ReportDataByYear' => 'Report Tables By Year',
+   'DBStats_MetricTables' => 'Metric Tables',
+   'DBStats_MetricData' => 'Metric Data',
+   'DBStats_MetricDataByYear' => 'Metric Tables By Year',

I think can be simplified eg. "Report data size" or "Report tables, data size"
Then "'Metric Tables By Year'," becomes "%s By %s" and we already know "year" and then you can plugin the one title. So all these translations become 3 new strings.
The new string could be "General_ReportByDate" => "%1$s by %2$s"

  • 'DBStats_GeneralInformation' => 'General Information',
    -> in General_
  • excellent new feature, footer related reports. We could reuse this feature for many other reports! Actually do you mind creating a ticket and close it with the commit, to clearly mark this in the changelog?
  • plugins/DBStats/API.php becomes too big. Also there are many public functiosn that shouldnt be public from the REST API point of view. Refactor all SQL related code in a new class in the DBStats plugin to keep the API cleaner and purely applying filters, no sql?
  • RowCallbackAddColumn -> rename to AddColumn
  • The new filter ColumnCallback -> is the same as existing ColumnCallbackReplace ?
  • The new filter Piwik_DataTable_Filter_RowCallback -> is not useful because filters are already all "rows callback" since they just loop on rows and run the function, this filter does not refactor logic
  • Controller new code: OK!
  • Controller graph code: ideally the graph should be fixed in core to automatically work well with any Unit on any axis. There is in API/API.php a function to get the unit for a column name. Somehow the graphs may use this to handle all units and plot data without requiring the big "hacks" like you had to do here. Im not sure how hard it would be to fix core?
  • Code in plugins/DBStats/templates/dbStats.js could be simplified by using the broadcast object which helps access parameters maybe? Or maybe it could be refactored in the broadcast if not there yet.

+   public function setRelatedReports( $relatedReports )
+   {
+       $apiMethod = "$currentControllerName.$currentControllerAction";
+       
+       $this->viewProperties['relatedReports'] = array();

Here $apiMethod will throw E_NOTICE ?

  • To build the URL in setRelatedReports() you should use instead the helper: Piwik_Url->getCurrentQueryStringWithParametersModified
  • getPrettyNumber() is a really good one. We should make sure it's used all over the place by calling the functio at the right poijnt in code (not sure where) so that all numbers are consistent in all datatables & in All websites reports and Email reports :)

@diosmosis
Copy link
Member

Some comments:

Replying to matt:

Wow, very nice UI and useful reports!

  • RowCallbackAddColumn -> rename to AddColumn
  • The new filter ColumnCallback -> is the same as existing ColumnCallbackReplace ?
  • The new filter Piwik_DataTable_Filter_RowCallback -> is not useful because filters are already all "rows callback" since they just loop on rows and run the function, this filter does not refactor logic

RowCallback & ColumnCallback are just intermediate classes. They don't do anything except execute a callback on some data and are not meant to be used directly. GroupBy & RowCallbackAddColumn derive from them so the code is clearer. I could get rid of them, but then there'll be some code redundancy. I could also keep them and make them 'abstract'.

As for filters, they seem to operate not on rows, per say, but entire DataTables. So I guess, technically, they can do whatever they want to a DataTable. Perhaps some of the logic for my RowCallback should be in the Filter base type?

I was trying to make my filter code clear, w/o refactoring existing code until my changes went under review. Do you think it would be better not to use base types in creating new filters?

  • Controller graph code: ideally the graph should be fixed in core to automatically work well with any Unit on any axis. There is in API/API.php a function to get the unit for a column name. Somehow the graphs may use this to handle all units and plot data without requiring the big "hacks" like you had to do here. Im not sure how hard it would be to fix core?

The problem wasn't that it wouldn't work w/ units, but that for memory values to display nicely, they'd need to use several different units (ie, B, KB, MB, ...). This would require modifying the jqplot related code, or possibly jqplot itself. Which would be hard, I'm guessing :)

@mattab
Copy link
Member Author

mattab commented May 7, 2012

RowCallback & ColumnCallback are just intermediate classes. They don't do anything except execute a callback on some data and are not meant to be used directly. GroupBy & RowCallbackAddColumn derive from them so the code is clearer. I could get rid of them, but then there'll be some code redundancy. I could also keep them and make them 'abstract'.

I think the code can be inlined it's pretty simple:
foreach($table->getRows() as $row) {
$function($row); // or call_user_func_array
}

It's a case that refactoring is not quite worth the trouble maybe, but also could be if you make them abstract.... up to you?

@diosmosis
Copy link
Member

I tested the blob/metric queries:

BLOB TABLE RESULTS

For a table w/ 100,000 rows: 6.8s
For a table w/ 200,000 rows: 13.56s

NUMERIC TABLE RESULTS

For a table w/ 100,000 rows: 5.66s
For a table w/ 200,000 rows: 11.2s

Hopefully, this isn't too slow. I think we could maybe cache the results of each query (except the current month) in the piwik_option table. Since it would be one per archive table, I don't think it would be all that much space. Would be a bit of a hack, though.

@diosmosis
Copy link
Member

Attachment: Patch for this issue. (2nd)
3004.diff.tar.2.gz

@diosmosis
Copy link
Member

Uploaded a new patch. Some notes:

  • I didn't add the '%s By %s' translation. I'm no linguist, but I think some language will object to this sort of string. Or maybe not. In either case, since there are similar translations that don't use '%s By %s', I left things unchanged for now.
  • I removed my intermediate callbacks and added a new one: ColumnCallbackAddColumn. This will execute a callback on one or more columns and add the result as a new column.
  • I didn't try and put calls to getPrettyNumber anywhere else. I think, though, that it might be possible to create a generic filter which will prettify numbers for relevent metrics (ie, nb_visits, but not revenue). Existing prettifying functions can be modified to use getPrettyNumber. Think that'll work?
  • I'll put my related reports code in a new ticket.

Let me know what you think of my patch.

@mattab
Copy link
Member Author

mattab commented May 24, 2012

Sorry for the delay, here is the review:

  • Thanks for your changes patch is nearly perfect now :)
  • Times are OK, less than 20s is fine, thanks for testing
  • you can add @ignore in the reduceArchiveRowName() doc block, so it doesnt show up in the API listing since it's not public for the API but because of implementation details
  • I don't think the file "plugins/DBStats/templates/dbStats.js" should be added via getJsFiles() hook, because it will be executed on document.ready for all pages. It would be better in this case to simply include via <script src=''> markup on the DBstats report.
  • The new call to $this->postDataTableLoadedFromAPI(); in core/ViewDataTable/GenerateGraphData.php might have some side effects, can you please check that vthe various graphs are still working ok?
  • + 'issummaryrow' => $id == Piwik_DataTable::ID_SUMMARY_ROW,
    For performance reasons, since the PHP renderer is used quite a bit, I suggest that you only set the index when it is defined, so that most rows will not have extra-data attached to them but only the summary row will.
  • getPrettyNumber() is a great idea but should be generalized across all of piwik, do you mind creating a new ticket for this so we don't forget?

feel free to commmit directly

@diosmosis
Copy link
Member

Replying to matt:

  • Times are OK, less than 20s is fine, thanks for testing

To be clear, that's ~6-13s per table. So if there is two years of data, it'd be 5.2 min, which I'm guessing isn't so great.

I think I could speed things up w/ a scheduled task that gets run once a month. It would create the table for all existing archive tables except the current month and store them in the options table. Sound good?

@mattab
Copy link
Member Author

mattab commented May 25, 2012

A scheduled task sounds good!

In the code, I recommend to cache the result in API.php in the option table. The scheduled task should not do the caching but simply call the API to "pre-archive" the data.

There are cases where new data could be added to the past tables (eg. all "january" tables contain the data of the yearly archives), so maybe the task could be weekly to make it slightly more accurate? It will only be executed for users who enabled the DBstats plugin so I'm not worried about overhead for most users since most users dont enable the plugin.

@diosmosis
Copy link
Member

Replying to matt:

A scheduled task sounds good!

In the code, I recommend to cache the result in API.php in the option table. The scheduled task should not do the caching but simply call the API to "pre-archive" the data.

There are cases where new data could be added to the past tables (eg. all "january" tables contain the data of the yearly archives), so maybe the task could be weekly to make it slightly more accurate? It will only be executed for users who enabled the DBstats plugin so I'm not worried about overhead for most users since most users dont enable the plugin.

I think I can get the last updated timestamp for a table and check if the table was modified since the last 'archive' (see http://stackoverflow.com/questions/307438/how-can-i-tell-when-a-mysql-table-was-last-updated ). This should make it accurate and fast :)

@diosmosis
Copy link
Member

Replying to capedfuzz:

Replying to matt:

A scheduled task sounds good!

In the code, I recommend to cache the result in API.php in the option table. The scheduled task should not do the caching but simply call the API to "pre-archive" the data.

There are cases where new data could be added to the past tables (eg. all "january" tables contain the data of the yearly archives), so maybe the task could be weekly to make it slightly more accurate? It will only be executed for users who enabled the DBstats plugin so I'm not worried about overhead for most users since most users dont enable the plugin.

I think I can get the last updated timestamp for a table and check if the table was modified since the last 'archive' (see http://stackoverflow.com/questions/307438/how-can-i-tell-when-a-mysql-table-was-last-updated ). This should make it accurate and fast :)

Actually, this won't work for InnoDB tables... So for now, the best options seems like your suggestion for the weekly task. Or maybe we could patch InnoDB? :)

@mattab
Copy link
Member Author

mattab commented May 25, 2012

could we maybe do SELECT MAX(ts_archived) ?
if you still have your big tables, would be interesting how fast it is?

Otherwise weekly full refresh is fine too.

If so could you write in the UI in grey "inline help" style eg "Report last updated on 2012, Jan 23th" to make sure users know about the 1 week delay.

@diosmosis
Copy link
Member

Replying to matt:

could we maybe do SELECT MAX(ts_archived) ?
if you still have your big tables, would be interesting how fast it is?

Otherwise weekly full refresh is fine too.

If so could you write in the UI in grey "inline help" style eg "Report last updated on 2012, Jan 23th" to make sure users know about the 1 week delay.

For 200,000 rows the query takes about ~.4s, however MAX(idarchive) is instantaneous, so I'll use that.

@diosmosis
Copy link
Member

(In [6324]) Fixes #3004, redesigned DBStats plugin, added several new reports including database space taken up by tracker tables, database space taken up by archive blob tables, database space taken up by archive metric tables, database space taken up by individual reports & database space taken up by individual metrics.

Notes:

  • Added ability to highlight the summary row in ViewDataTable and the ability to always show the summary row regardless of what report page is currently being shown.
  • Fixed small issue w/ ViewDataTable::hasReportBeenPurged: default values should be specified in calls to getRequestVar.
  • Added Piwik_FetchAssoc function to PluginsFunctions/Sql.php
  • Augmented ColumnCallbackAddColumnQuotient filter so the divisor can be obtained from the summary row.
  • Modified AddSummaryRow filter so it wouldn't delete rows if desired.
  • Added ColumnCallbackAddColumn filer that adds a new column to each row based on the result of a callback.
  • Modified ColumnCallbackReplace filter so callback can operate on more than one column value if desired.
  • Modified Limit filter so, if desired, the summary row can be exempted from deletion.
  • Added GroupBy filter that groups/sums rows by the result of a callback function.
  • Fixed GenerateGraphData.php bug where priority filters were not called on view data table.
  • Added getPrettyNumber utility function.

@diosmosis
Copy link
Member

(In [6329]) Refs #3004, fix PrivacyManager regression.

@mattab
Copy link
Member Author

mattab commented May 28, 2012

Wow big commit!! Nice work..

Reopening & Code Review:

  •       if ($id == Piwik_DataTable::ID_SUMMARY_ROW)
    
  •       {
    
  •           $newRow['issummaryrow'] = Piwik_DataTable::ID_SUMMARY_ROW;
    
  •       }
    

maybe should be = true or = 1 instead of -1 ?

  • in public function hasReportBeenPurged() the following if() should not accept == 0 correct?
    if ($idSite == 0 || intval($idSite) != 0)
  • resetTableStatuses / resetTableStatuses should checkUserIsSuperUser()

@diosmosis
Copy link
Member

(In [6340]) Refs #3004, fix regression due to keep_summary_row JS check.

@diosmosis
Copy link
Member

(In [6343]) Fixes #3004, tweaks.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug For errors / faults / flaws / inconsistencies etc. Critical Indicates the severity of an issue is very critical and the issue has a very high priority.
Projects
None yet
Development

No branches or pull requests

2 participants