Sunday, October 30, 2011

Debugging a slow site using Apache response time and TinyLogAnalyzer

We met recently a new customer, a company that need help with a slow application composed by:
  • a Plone CMS
  • a set of external processes (a daemon, an e-mail scan system, ...)
  • an complex integration with an external database engine
  • an integration with two kind of Visual Basic based client desktop application
With the grow of the company bussines they suffer in years a gradually decay of performance.
Some architectural changes were needed, but the company also asked for some quick fix, to get back a faster application while major changes are planned.

Tune up this application was not simple because the whole system is Windows based, so a lot of well-know software we commonly use are not applicable. The company contacted us because the core of the application seems the Plone CMS and the underlying Zope, but we met a customized version of a very old Plone installation. More or less we were facing something that was not Plone.

We decided to start to look at the code (everyone always hope to find some app-go-slow=True flag activated!), but meanwhile we enabled the response time log on the Apache front-end (and it works also on the Windows version!)

How long does it take to serve a request?
You can change the default Apache HTTP log configuration from something like this:
    LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

...to this:
    LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %T/%D" combined

Suddently you log format change to this:
    [31/Jan/2008:14:19:07 +0000] "GET / HTTP/1.1" 200 7918 "" "..." 0/95491

...adding the time (in seconds and microseconds) needed to handle the request.

Please note that in a complex installation this will measure the total time needed for the request (if you put a reverse-cache proxy and a load balancer behind your Apache, this will raise the total time) but their Windows installation is very simple: Apache and Zope behind.

How to analyze results?
After a month of collected log data in the new format we wanted to see some statistics. The problem now was: how to read this log to learn something?

After some Google search I was a little worried. There are some real-time monitor tool for log analysis, but we are really scared to try them on a production Windows environment.

The only log analyzer found was apache-response-time: the projects description is really promising, however I was not able to use it: I installed it on a Linux and MacOS system but both I get a segmentation fault error.
Neither a stackoverflow question helped me to find something more. 

So I started to write some few Python lines preparing a custom analyzer. This job became interesting, so I spent a pair of week-ends for code refactoring and to make initial foo.py script is something reusable.

The result was TinyLogAnalyzer, a command line HTTP log analyzer that take a log file with response time enabled and extract two kind of statistics: the top total time and the top average time.

The top total time give you an idea of what is the most time consuming part of the application. In out case we found that more than 20% of the total time was spent performing searches and a special call performed by an external daemon.
Some optimization and changes to this code make the customer a little happier.
Also we find that a lot of requests were done for download external JavaScript resources, never cached.

The top average time instead give you an idea of the slower part of the application. With this I found that object creation and traversing in a special part of the application was really slow due to use of basic (not BTree) Plone folders.

All other release of the log analyzer product you can find are bug-fixing and new features still done for the same customer's problem (meanwhile we collected two more months of log).
For example, I found that some of the external tasks run also during the weekend, so collect data during Saturday and Sunday was wrong. The same for the night time.

Conclusion
The script is all but perfect. For example it keep all data in memory (so it's not the good tool for handle gigabytes-long-file) and read the log from the first line to the last (so if you need to skip some months it is still really slow).

However: it helped me a lot, giving some time to us and some breath to the customer, making things going better while major refactoring are planned.

Apart of the script itself, I think that enabling the Apache response time can be a great source of information when things goes wrong.

Sunday, October 16, 2011

Plone, security and workflows: learn how to use "Acquire" option

This article is a new episode of my security and workflow common mistake using Plone.

In the first part, "Plone, security and workflows: when rely on Owner role is bad" I introduced a potential workflow problem, that however can be a real problem only in some kind of organizations.

Today I don't want to focus onto roles, but on the use (or missing use) of the "Acquire" check when designing workflows.
Don't know how to use this checkbox can lead to security problem, especially in conjunction with the View permission.

Is also important to be noted that this problem can be inside current workflows distributed using Plone. This depends on how you and your users think that Plone security must work to fill your needs.

What is the "Aquire" meaning?
You can see the "Acquire" check ("Acquire permission settings?") while defining the workflow permissions page, for every permission managed by the workflow.

A workflow state's permission setting
The simple meaning is "roles that behave this permission are the ones listed there, plus ones defined on the container of this content".

If the parent itself use a workflow and manage same permissions, it can also use Acquire check again, so acquiring other permissions for roles from it's own parent... and so on.
This continue until the Plone site (or sometimes, up again to Zope root object) where all permissions are defined in the "Security" tab.

Use the Acquire settings without flagging additional roles for a permission is ok: the meaning is "this workflow state don't want to manage this permission". Instead, another state of the same workflow will probably add or change something to the permission.
If every state of a workflow simply use the Acquire check without change something, the customization of this permission in the workflow is useless and can probably be removed.

Also: don't manage a permission for a workflow is still ok. You will never find a workflow that handle all permissions. Only manage the ones you care about (so, for workflow for custom content types, please remove the "Change portal events" permission!).

Out of the box security
More than one Plone standard workflows suffer of what we can define "a not perfect configuration". This do not mean that this is a Plone architectural problem: with Plone you can create great, perfectly working and safe, workflows! Just: the provided ones could be not perfect in some situations and for being able to fix them or create new ones, you need to understand why!
Also, if the workflow is ok or not, depends on how you and your customer think that Plone must work.

The problem is commonly associated with the View permission, however all analysis that follow are the same for every other permissions. We will refer to "Simple Publication Workflow", while will be the same for other like "Intranet/Extranet Workflow".
The "Simple Publication Workflow" is used from all Plone default contents, also by folders.

Let's take a look at the published state permissions setting.

How the default public state looks like
Look in detail at the Anonymous column. The View permission is given at the Anonymous role. And this seems good: you are explicitly saying "every content in this state will be visible by the anonymous". Always. Everywhere.

How can be this bad? The content is public, I want that Anonymous see it!

Let think this applied to folders: users (99% of times, in my experience) see the private and the published state like a room with a door:
  • A published folder is like a room with an open door. You can see inside the room itself, and other rooms that follow (if those rooms are open)
  • A private folder is like a closed door. You can't see inside it, and you don't care if other doors that follow are open or closed. You don't even know if there are other doors!
Users that move from published to private a main folder that contains a deep tree of subfolder and contents, commonly think that this action will hide all the tree!

You probably already get the point. An explicit check of the View permission for a role is like a magic crystal ball. You don't care of the door: you have the power too see inside.

Going back to Plone world: think about having a Plone folder (folder-1) with a document inside (document-1). If the document is published you can see it using your crystal ball. The review state (even if private) of the containing folder doesn't matter.
This is what is not always clear to users.
  • Users can reach http://yourcompany.com/
  • Users can't reach http://yourcompany.com/folder-1 (he will get Unauthorized)
  • Users can reach http://yourcompany.com/folder-1/document-1
During developer training that customers ask us, one of the argument is always how to customize workflows. Sometimes those customers have already installed a Plone site or intranet by their own so when I explain this, is not uncommon to see surprised faces.

O__o

Another bad attitude: customers never tested the workflow deeply and they think that their contents are is a secure review state just because a main folder is private.

How to fix this
As said above: this is not a Plone technological error. There's a way to get exactly what we want: a set of rooms with doors, with no crystal ball, secret passage or teleport bridge that can transport unwanted users inside the room.

The trick is: don't say that the document is visible explicitly, leave this responsibility to the container (to the room's door).

This is traduced in a workflow configuration that remove explicit permission for Anonymous but use the Acquire power.

A workflow state's permission setting
Note also an important thing: we don't give the check to Anonymous, but we now check permissions for all other roles that matter. We mean there: all roles that must see the content ignoring if the access door is open or closed.

In this way an anonymous user is able to see the content if he can enter the room (published container folder) and will be no more able to see anything when the door is closed (private container folder).

This is really simple!

When not talking of anonymous access
Somewhere above I said that also the "Intranet/Extranet Workflow" suffer the same configuration problem. Can we fix also this?

This kind of workflow is designed for intranets, but also provide some review state for public contents: it add the "internally published" concept to our Plone site.

Note that this workflow is designed to be used with "Intranet Workflow for Folders" associated with folderish content types.

When using those workflows, we are sometimes referring to the Member roles as we did for Anonymous above.

The default configuration of the "internally_published" state is as follow:

A workflow state's permission setting
The concept behind this state is "having a published state for intranet users" (so excluding Anonymous).
So we have the same problem for the explicit check of the View permission, this time for Member role.

We can fix this in the same way:

A workflow state's permission setting
We removed the explicit View permission for Member role while enabling the "Acquire".

This time the cure has some limit.

First problem: if we put an internally_published content in the site root, in our "fixed" version of the workflow we are exposing the content to anonymous access, while the original version protects it (know that you will never remove the View permission for Anonymous role from the site root or Plone will work in unpredictable way).

Another problem: take a look at states defined for "Intranet Workflow for Folders". It's composed by two states: private and internal. If we look at the internal description we read "Visible to all intranet users", so it want to be a "published folder for intranet users".

The private state has no View permission for Member role, while the internal state explicitly check it for site members.
As explained above, this mean that if we put an internal folder inside a private folder, site members will be able to skip the private state of the main container folder and access directly the inner ones and all contents inside.

If we fix this removing the explicit check for Member while activating the "Acquire" check we will have a similar problem to the first ones: anonymous users will be able to access the internal folder in the site root...

There is no perfect solution for this problem using only a good workflow configuration. We need to make some default configuration to the site structure, helping the workflow:
  • We fix all "internal published like" states of both workflows as described (using "Acquire")
  • We put all internal folder in one (or more) main private folder(s) in the site root
  • We give to all users (we can also use the "authenticated users" virtual group) the "can read" check (Reader role) from the sharing tab of this private folder
  • We remove from all contents/subfolders inside master private folder(s) the roles inheritance from the sharing tab, removing Reader power to all members
  • Inner subfolder and contents will live with normal worfklow security settings
With a configuration like this we can be sure that Anonymous users can't access our contents (as they all live in a private folder) while still be able to control the visibility of a subtree changing the main folder review state (relying on "Acquire").

Think about having a dedicated workflow for folders
Designing a workflow can often be design a set of workflows. This is common; the "Intranet Workflow for Folders" is a perfect example. To use it right, it's better to use also the workflow for folders.

I find often useful to create two different workflow for real contents and folders.
Here a common request I get: Reader role must be able to enter inside private area, but inside this being able to only published contents.

Obviously the workflow I will design will be a workflow that rely onto "Acquire".
While the private state for folders will give explicit View permission to Reader, the private state for contents won't give such power. Done!

Conclusion
Before load your new workflow in a production site or intranet, be sure of what workflow you give to contents... and test it.

A simple way for performing those tests is_ search contents using the site search feature.