CPU utilization problems with large userbase


#1

We have been trialing Wekan at a large organisation for nearly a year, starting with Libreboard and then moving on to Wekan. Since the move to Wekan we’ve had major CPU utilization issues by the node process that results in it failing.

Configuration is as follows (all servers running Red Hat 7.3):

  • 3 virtual servers (2x CPU, 4GB RAM), each hosting a single Wekan instance (currently v0.41) running on Node v4.8.4
  • 3 virtual servers (2x CPU, 4GB RAM), each hosting a MongoDB instance, configured in a ReplicaSet
  • 1 virtual server (2x CPU, 4GB RAM) running Nginx as a load balancer

Our usage figures are as follows:
Users: 2785 Boards: 3564 Lists: 10418 Cards: 21361

The problem we’re encountering is that as the number of concurrent users increases (anything above 50), the Node CPU usage figures for the node instances increases to between 90-100% until the process fails to be able to respond to requests through the web browser. It’s at this point that I need to either restart the Node process or it fails and is restarted by systemd.
For information the memory usage on the Node processes is low (<10%)
For information the CPU usage on the primary MongoDB server is 20-30%.

I’ve tried to debug the cause of this but have so far been unable to find a root cause. My suspicion is that there’s a costly process attached to each of the connected clients that ramps up the CPU usage but I don’t know enough about Meteor/Node/MongoDB to debug it. Can anyone suggest any routes to investigate?


#2

@GavinLilly

I would guess bottleneck is MongoDB. Using PostgreSQL with ToroDB would speed up at least read traffic. There is some ToroDB server where replacing MongoDB with PostgreSQL is possible, and also using MySQL, so you could look at those from ToroDB.


#3

I’ll be on honest, I’m not convinced that MongoDB is the source of the problems. I’ve been looking at the usage stats this morning and not seeing anything out of the ordinary. You can see a brief output from mongostat here. A lot of queries but barely anything in the queues.

I’m currently trying to setup a Kadira instance to help find the source of the problems. Will report back once I have more info


#4

Yes, profiling would be much more useful than my guess :slight_smile: Thanks!


#5

I have the same Issue,
Configuration is as follows (all servers running CentOS 7.5):

1 virtual server (8x CPU, 16GB RAM), hosting 5 single Wekan instance (currently v0.50), a MongoDB instance, a Nginx as a load balancer, running on Node v4.8.4

If a Wekan process over 20 clients, the CPU usage will exceed 100%, and client browser will be slow. But
at this time MongoDB is about 50% CPU usage.

Have any suggestion? Thanks.


#6

Today released Wekan v0.60 has fixes to this, please test:


#7

We’ve been using v0.60 since the 6th December and so far performance actually seems worse. Processes need to be restarted more often and the app is slower in response.

I’m going to look back through the intermediate versions between v0.50 and v0.60 to see if there was some other feature that may have had an impact.


#8

Hi

We have the same issue here at SG. We have 4000 boards and 3050 users (93000 cards)
We have 1 server (physical) with 64 CPU and 512 GB of ram. Ram is not the issue but CPU consumption of each nodes, exact same symptom than @GavinLillyW put load balancer HAproxy docker and we have 20 docker instances of weaken (and 1 mongo)
The only “stable” version is the v.13, each time I’m going to another one, instance move to 100%. I tested today the v0.71, same issue
Thanks


#9

That’s interesting that only v0.13 is the stable one for you. If you compare 0.13 against the next release (0.16) you’ll see there’s very few changes; certainly no major changes.

Are the 20 instances managing the load for your userbase? How are you managing the restart of those instances when they hit 100% usage and stall?


#10

Yes but on docker hub the next available is v0.19 where we have if my memory is good a lot of change
We have haproxy like you on top of the 20 instance and we have a script that killing 100% CPU docker containers…


#11

Ahhh ok well comparing v0.13 against v0.19 there are a lot of changes but primarily they’re i18n and formatting changes. The only significant changes I can see are the introduction of allowing users to only comment on boards(?), the REST API and the Winston logger (though I think this is disabled by default).

I don’t see any detrimental changes introduced here but then my NodeJS/Meteor is not great.


#12

@fmonthel

Are you using pm2 ? I did have report from Wekan user that had thousands of users, that after removing pm2 and upgrading to Wekan v0.71 performance was much better and it is possible to to use Wekan normally.


#13

Hi @xet7
We’re using official docker hub image https://hub.docker.com/r/wekanteam/wekan/
Not sure if nodejs is coming with PM2 in this docker image
Thanks


#14

No, Wekan does not include PM2.

It is process manager, some had installed wekan with it.


#15

Hi,
we have the exact same issue here with:
Users: 5k Boards: 5k Lists: 25k Cards: 65k

We started years ago from LibreBoard and are now stuck with the 0.11.0-rc2 Docker image from mquandalle.
Any migration to higher version was followed by a rollback as nodes are shortly crashing with 100% CPU.

We have an HAProxy that load balance to 6 nodes running on Docker.
The mongodb is in replicaset mode (v3.2.13), performance are good on this side.

The last version we tried is v0.63. It was still failing even with smart-disconnect (that was successfully reduced the number of connected users per node from ~35 to ~6). Does that mean it’s not a performance issue?

For now, we didn’t succeed to reproduce the issue in an iso-production platform.

Best,
Clement


#16

@clement

Similar amount of users is running with latest Wekan version very well with these AWS server specs:
3-4x m4.large for Node (ECS Cluster)
3x r4.large for Mongo (1 Primary for read and write, 2 replicas)

This well working setup does not yet have redis oplog added, it would improve scalability even more:

There is something wrong in your setup. Are you running in AWS or private cloud ?


#17

@clement

For Enterprises using Wekan I really recommend participating in Wekan development, with submitting features and performance bugfixes etc upstream, having your own developers working on Wekan daily, and using Commercial Support at https://wekan.team . With the benefits you get by using Wekan, it’s time well spent.

https://blog.wekan.team/2018/02/benefits-of-contributing-your-features-to-upstream-wekan/index.html

https://blog.wekan.team/2018/02/time-well-spent/index.html