Original URL: https://notes.typo3.org/p/post-mortem-travisci-2015-07
tl;dr we enabled replication to Github and overloaded TravisCI
TravisCI status report: https://www.traviscistatus.com/incidents/4gbn8hv2sp4m
h2. Setup
We use Gerrit as code review system: https://review.typo3.org/
We replicate every (proposed) change to https://github.com/typo3-ci/TYPO3.CMS-pre-merge-tests in order to trigger a TravisCI build (see the changes/x/y/z branches)
This provides feedback regarding whether the test succeed, even before the commit is merged into the official code base (basically giving us the comfort that others have with Github pull requests)
h2. Changes on 26.07.2015
* As we replicated only refs/changes/* from Gerrit to refs/heads/changes/* on GH, the status there did not reflect the actual difference between the current state of the target branch and the proposed change (as all release branches and master where simply put.. outdated since ~Jun 20th)
* Gerrit was now told to replicate also refs/heads/* and refs/tags/* to GH (around 15:30 GMT+2)
* This caused Gerrit to push numerous commits to GH (around 650 for master branch + xyz for other branches)
* TODO: Insert the missing piece here
* Speculation by AndyG: Gerrit replicates commit by commit, instead of updating the remote branch to the tip of the local branch immediately. To be confirmed
* Speculation by AndreasW: Each version of every patch added to Gerrit since Jun 20th was pushed (that should be more than 650)
there are ~580 commits on master from Jun 20th to Jul 27th
* I (steffen) think this is not the case. We had refs/changes/* replicated all the time. Only refs/{tags,heads}/ was missing and due to the screwed chef setup, we were not able to deploy that update
* This somehow filled up the build queue on TravisCI, which caused starvation for other project's builds (and obviously caused some data base issues?)
h2. Remedy
* Travis refused builds from the TYPO3-ci/TYPO3.CMS-pre-merge-tests repo
* Travis staff contacted us and joined the #typo3-cms channel in slack
h2. Open Questions
* Dan Buch from TravisCI wrote about build requests that are "in the thousands" for that repo. We replicated maybe ~1000 commits to master + release branches.
* Some people received, however, also status mails for commit that are ~4 years old. Not yet clear, why those were built
* Does Gerrit really replicate to remote branches commit by commit?
h2. Conclusions / Next Steps
* We should be careful when replicating a large number of commits to a travis-enabled repo
Post-Mortem 2015-07-26-TravisCI-Downtime
========================================
Author: Steffen Gebert\
Original URL: https://notes.typo3.org/p/post-mortem-travisci-2015-07
tl;dr we enabled replication to Github and overloaded TravisCI
During two occasions during the last days, service availability was impaired after changing @/etc/network/interfaces@ to add our VPN interfaces.
* Saturday, Nov 12th: backup server
* Friday, Nov 16th: physical host server ms06
Both times @service networking restart@ resulted in a permanent loss of connectivity. As we locked out of the running server externally triggered reboots were required.
Side note: We do not manually re-configure our servers, but use Chef instead. However, IP address configuration is not part of the Chef setup.
h2. Root Cause
While a syntax error was at least partially the reason in one case, we nevertheless experienced the same connectivity issue when running @service networking restart@ with a correct configuration.
The syntactically correct config file was only accepted after the reboot.
h2. Resolution and Recovery
In the first occurrence, we contacted the organization hosting our backup server, asking for a reboot. In the second case, we had means to execute a remote reset.
h2. Corrective and Preventative Measures
All changes to the network configuration should be backed by an automatic revert procedure that would kick in, if not disabled by the operator who remains connected.
According to "this issue":https://bugs.launchpad.net/ubuntu/+source/ifupdown/+bug/1301015, @service networking restart@ should not be used. Instead, use
<pre>( ifdown iface; ifup iface ) &</pre>
However, we are not certain about this, if this would be really sufficient in all cases.
The following procedure should be automatically triggered to prevent further failures, independent of the way to reset networking:
* After 1 minute: Revert the configuration file change and restart networking
* After 5 minutes: Reboot the server
The following "gist":https://gist.github.com/StephenKing/83fedc56137f5640de929b4430f1b653 can be used, assuming that a backup has been created in @/etc/network/interfaces.bak@.
The SOLR search (including the TER listing) on typo3.org was unavailable, caused by a faulty ACL being deployed.
h2. Timeline
- 13:48 Jochen Roth "mentioned in slack":https://typo3.slack.com/archives/typo3-cms/p1480596515011722 that the TER is unavailable
- 14:37 Michael Stucki noticed this comment and "commented in the #typo3-server-team channel":https://typo3.slack.com/archives/typo3-server-team/p1480599426000787
- 14:51 Steffen Gebert started to investigate this issue
- 14:59 "Corrective fix":https://chef-ci.typo3.org/job/chef-repo/86/ being deployed
- 15:00 Search is available again
h2. Root Cause
- caused by recent cookbook upload, a new and breaking version of the ohai chef cookbook (>4.0) was deployed; this changes the way, how the path to plugins is configured
- as a result, our plugin to fix the IPv6 address detection in OpenVZ was not applied anymore
- as a result, the solr cookbook became unaware of the @ip6address@ of the typo3.org server (by the means of Chef search)
- as a result, the solr cookbook deployed the ACLs for tomcat excluding the IPv6 address of the typo3.org server
- the typo3.org server was unable to contact the SOLR server anymore
h2. Resolution and Recovery
A "change to the chef environment":https://chef-ci.typo3.org/job/chef-repo/86/ now enforces the use of the up-to-date version of the Chef cookbook "t3-openvz":https://github.com/TYPO3-cookbooks/t3-openvz
h2. Corrective and Preventative Measures
- monitoring checks were updated to catch this error. While we were monitoring both the search function as well as the TER listing, the search strings seem to be still included in case of errors, resulting in the defect being not detected
- we are about to upgrade our infrastructure to the newer ohai cookbook while still maintaining compatibility with our plugins.
- we will look into Chef's @Policyfile@ feature, if this helps us with both, becoming more confident about what cookbook versions are in use as well as being able to update platform cookbooks without touching every application stack.
Post-Mortem 2016-12-01-SOLR-Search-typo3org
===========================================
Authors: Steffen Gebert, Michael Stucki
Issue Summary
-------------
The SOLR search (including the TER listing) on typo3.org was
unavailable, caused by a faulty ACL being deployed.
Connections to Gerrit SSH (port 2941) were hanging (~every third try one).
h2. Timeline
- Wed, 17:26: Stephan Großberndt reports the problem to the "#typo3-server-team channel":https://typo3.slack.com/archives/typo3-server-team/p1482337598000394
- Thu, 6:59 Starting to investigate this issue
- Thu, 7:45 Gerrit restart resolved the problem
h2. Root Cause
Gerrit did not respond to the client's SSH connection initialization:
!wireshark.png!
The exact reason remains still unknown. There are a couple of threads in the Internet about hanging Gerrit SSH connections. Most of them report @database.poolLimit@ as a possibly limiting factor. We have set them to @36@, this "should be sufficient".
h2. Resolution and Recovery
@systemctl restart gerrit@ resolved the problem.
h2. Corrective and Preventative Measures
The next time this issue occurs, the following should be checked:
- open tasks: @ssh review.typo3.org -p 29418 gerrit show-queue -w@
- open connections: @ssh review.typo3.org -p 29418 gerrit show-connections@
- maybe change log level using @ssh review.typo3.org -p 29418 gerrit logging@. SSH log file is in @/var/gerrit/review/logs/sshd_log@.
- thread dump: @jstack <pid>@ as @gerrit@ user (you can get the @pid@ from @systemctl status gerrit@)
Post-Mortem 2016-12-22-Gerrit-SSH-Hang
======================================
Authors: Steffen Gebert
Issue Summary
-------------
Connections to Gerrit SSH (port 2941) were hanging (\~every third try
one).
Timeline
--------
\- Wed, 17:26: Stephan Großberndt reports the problem to the
h1. Post-Mortem 2017-01-18: Multiple Services affected by private network flapping
Authors: Steffen Gebert, Andri Steiner
h2. Issue Summary
All services running within the new KVM infrastructure were temporarily affected by issues caused by flapping of internal network connectivity.
h2. Timeline
- 07:13: First monitoring mails indicating issues
- 07:25: Georg Ringer reports in the "#typo3-server-team":https://typo3.slack.com/archives/typo3-server-team/p1484720740000734 channel about connection issues to Gerrit SSH
- 08:07: Issues confirmed by Michael Stucki
- 08:10: Our team is gathering together in the hotel (as we are on a sprint right now)
- 08:30: Good indications for Layer 2 bridging / ARP irregularities on both of the new KVM servers (ms08 and ms09)
- 08:47: "Tweet":https://twitter.com/TYPO3server/status/821624984447680512 informing about service outage
- 09:29: Service availability restored ("tweet":https://twitter.com/TYPO3server/status/821635606040248321)
h2. Root Cause
- Not exactly known
- we have seen incoming packets on physical server on the @br-int@ interface, which are not forwarded to the @vnetX@ interface of the VM
- we have seen incomplete ARP tables on the VMs (@arp -a@)
- we have seen tons (hundreds per second) of ARP packets, mostly for resolving the gateway address (@tcpump -i br-int arp@ on the host servers), which are all correctly answered. Many duplicated.
- @dmesg@ emitted many of the following messages:
> br-int: received packet on t3o with own address as source address
> net_ratelimit: 216 callbacks suppressed
- there was no duplicated MAC address involved
- tinc log files (@/var/log/tinc.t3o.log@) did not include anything helpful
- after shutting down the private VPN (@systemctl stop tinc@) on the physical host, these ARP storms stopped
- starting @tinc@ again restored connectivity to other hosts, without any further ARP storms
The definite cause is unknown.
- One variant is that tinc cause some looping or dropped packets
- Another one that the real issue is the local bridging and tinc was just a side-effect
Once the issue occurs again, please **try the following first**:
- remove tinc from the bridge
<pre>
brctl delif br-int t3o
</pre>
to figure out, if it is a bridging or tinc issue. After checking success, use above commands to restart tinc and restore clean state.
- create a tcpdump for all interfaces:
<pre>
tcpdump -i all -w dump.cap
</pre>
h2. Corrective and Preventative Measures
- we plan to get rid of tinc VPN for the long term, once we have services migrated to the new infrastructure and order rack space for all servers including dedicated internal network
- if this happens again, we could "issue an @INT@ signal to the tinc process":https://www.tinc-vpn.org/documentation/tincd.8 to temporarily switch to debug logging (or @USR1@ / @USR2@ signals to get connectivity status).
- get better understanding, how to debug such connectivity / bridging issues and how / when the kernel drops packets etc.
Post-Mortem 2017-01-18: Multiple Services affected by private network flapping