Verified Commit 3cdb538c authored by Andri Steiner's avatar Andri Steiner
Browse files

Convert RST to Markdown #1

parent 90ac327f
h1. Post-Mortem 2015-01-12-Gerrit-Database
Author: Steffen Gebert
Original URL: https://forum.typo3.org/index.php/t/208068/
Today from ca. 18:20-21:00h CEST Gerrit showed error 500s when
accessing changes because the MySQL service stopped crashed (repeatedly).
Prior to that, the server received a hard reset two times today due to
data center power outages.
Reason for the problem now was that the "changes" mysql table was
crashed and InnoDB halts the server once it notices checksum mismatches.
How to fix:
http://www.percona.com/blog/2008/07/04/recovering-innodb-table-corruption/
* start server with @innodb_force_recovery=1@ so that the server is
read-only but will *not* halt on read error
* dump table
* restart without recovery flag
* drop original table
* import dump
What also helped me to verify what table is crashed:
<pre>
# for i in $(ls /var/lib/mysql/*/*.ibd); do innochecksum -v $i; done
</pre>
via "dba.stackexchange.com":http://dba.stackexchange.com/questions/6191/how-do-you-identify-innodb-table-corruption
Post-Mortem 2015-01-12-Gerrit-Database
======================================
Author: Steffen Gebert\
Original URL: https://forum.typo3.org/index.php/t/208068/
Today from ca. 18:20-21:00h CEST Gerrit showed error 500s when\
accessing changes because the MySQL service stopped crashed
(repeatedly).\
Prior to that, the server received a hard reset two times today due to\
data center power outages.
Reason for the problem now was that the "changes" mysql table was\
crashed and InnoDB halts the server once it notices checksum mismatches.
How to fix:\
http://www.percona.com/blog/2008/07/04/recovering-innodb-table-corruption/
- start server with `innodb_force_recovery=1` so that the server is\
read-only but will **not** halt on read error
- dump table
- restart without recovery flag
- drop original table
- import dump
What also helped me to verify what table is crashed:
# for i in $(ls /var/lib/mysql/*/*.ibd); do innochecksum -v $i; done
\
via
[dba.stackexchange.com](http://dba.stackexchange.com/questions/6191/how-do-you-identify-innodb-table-corruption)
h1. Post-Mortem 2015-01-12-Power-Outage
Author: Steffen Gebert
Original URL: https://forum.typo3.org/index.php/t/208069/
Today, several sites were repeatedly unavailable due to a power outage
in a data center, where two of our servers are located.
Affected sites include:
- - mailing list / news group server
- - mailing list T3A
http://buzz.typo3.org
http://forger.typo3.org
http://git.typo3.org
http://notes.typo3.org
http://review.typo3.org
http://svn.typo3.org
http://wiki.typo3.org
- From the information that we received from one of the two sponsors,
there was a defect in the redundant power supply coming from the data
center company. In addition to that, the diesel generators did not
take over power supply, before the UPS batteries were drained.
We thank our sponsors for their sponsorship and appreciate their
diligent work and support. More info: http://typo3.org/teams/server-team/
Kind regards
Steffen
P.S: Regarding a longer outage of the Gerrit review system on
https://review.typo3.org, I published an extra post mortem report:
http://forum.typo3.org/index.php/t/208068/
Post-Mortem 2015-01-12-Power-Outage
===================================
Author: Steffen Gebert\
Original URL: https://forum.typo3.org/index.php/t/208069/
Today, several sites were repeatedly unavailable due to a power outage\
in a data center, where two of our servers are located.
Affected sites include:
\- - mailing list / news group server\
- - mailing list T3A\
http://buzz.typo3.org\
http://forger.typo3.org\
http://git.typo3.org\
http://notes.typo3.org\
http://review.typo3.org\
http://svn.typo3.org\
http://wiki.typo3.org
\- From the information that we received from one of the two sponsors,\
there was a defect in the redundant power supply coming from the data\
center company. In addition to that, the diesel generators did not\
take over power supply, before the UPS batteries were drained.
We thank our sponsors for their sponsorship and appreciate their\
diligent work and support. More info:
http://typo3.org/teams/server-team/
Kind regards\
Steffen
P.S: Regarding a longer outage of the Gerrit review system on\
https://review.typo3.org, I published an extra post mortem report:\
http://forum.typo3.org/index.php/t/208068/
h1. Post-Mortem 2015-07-26-TravisCI-Downtime
Author: Steffen Gebert
Original URL: https://notes.typo3.org/p/post-mortem-travisci-2015-07
tl;dr we enabled replication to Github and overloaded TravisCI
TravisCI status report: https://www.traviscistatus.com/incidents/4gbn8hv2sp4m
h2. Setup
We use Gerrit as code review system: https://review.typo3.org/
We replicate every (proposed) change to https://github.com/typo3-ci/TYPO3.CMS-pre-merge-tests in order to trigger a TravisCI build (see the changes/x/y/z branches)
This provides feedback regarding whether the test succeed, even before the commit is merged into the official code base (basically giving us the comfort that others have with Github pull requests)
h2. Changes on 26.07.2015
* As we replicated only refs/changes/* from Gerrit to refs/heads/changes/* on GH, the status there did not reflect the actual difference between the current state of the target branch and the proposed change (as all release branches and master where simply put.. outdated since ~Jun 20th)
* Gerrit was now told to replicate also refs/heads/* and refs/tags/* to GH (around 15:30 GMT+2)
* This caused Gerrit to push numerous commits to GH (around 650 for master branch + xyz for other branches)
* TODO: Insert the missing piece here
* Speculation by AndyG: Gerrit replicates commit by commit, instead of updating the remote branch to the tip of the local branch immediately. To be confirmed
* Speculation by AndreasW: Each version of every patch added to Gerrit since Jun 20th was pushed (that should be more than 650)
there are ~580 commits on master from Jun 20th to Jul 27th
* I (steffen) think this is not the case. We had refs/changes/* replicated all the time. Only refs/{tags,heads}/ was missing and due to the screwed chef setup, we were not able to deploy that update
* Log output: https://gist.github.com/StephenKing/247ad6e3704d41b97a84
* This somehow filled up the build queue on TravisCI, which caused starvation for other project's builds (and obviously caused some data base issues?)
h2. Remedy
* Travis refused builds from the TYPO3-ci/TYPO3.CMS-pre-merge-tests repo
* Travis staff contacted us and joined the #typo3-cms channel in slack
h2. Open Questions
* Dan Buch from TravisCI wrote about build requests that are "in the thousands" for that repo. We replicated maybe ~1000 commits to master + release branches.
* Some people received, however, also status mails for commit that are ~4 years old. Not yet clear, why those were built
* Does Gerrit really replicate to remote branches commit by commit?
h2. Conclusions / Next Steps
* We should be careful when replicating a large number of commits to a travis-enabled repo
Post-Mortem 2015-07-26-TravisCI-Downtime
========================================
Author: Steffen Gebert\
Original URL: https://notes.typo3.org/p/post-mortem-travisci-2015-07
tl;dr we enabled replication to Github and overloaded TravisCI
TravisCI status report:
https://www.traviscistatus.com/incidents/4gbn8hv2sp4m
Setup
-----
We use Gerrit as code review system: https://review.typo3.org/\
We replicate every (proposed) change to
https://github.com/typo3-ci/TYPO3.CMS-pre-merge-tests in order to
trigger a TravisCI build (see the changes/x/y/z branches)\
This provides feedback regarding whether the test succeed, even before
the commit is merged into the official code base (basically giving us
the comfort that others have with Github pull requests)
Changes on 26.07.2015
---------------------
- As we replicated only refs/changes/\* from Gerrit to
refs/heads/changes/\* on GH, the status there did not reflect the
actual difference between the current state of the target branch and
the proposed change (as all release branches and master where simply
put.. outdated since \~Jun 20th)
- Gerrit was now told to replicate also refs/heads/\* and refs/tags/\*
to GH (around 15:30 GMT+2)
- This caused Gerrit to push numerous commits to GH (around 650 for
master branch + xyz for other branches)
- TODO: Insert the missing piece here
- Speculation by AndyG: Gerrit replicates commit by commit, instead of
updating the remote branch to the tip of the local branch
immediately. To be confirmed
- Speculation by AndreasW: Each version of every patch added to Gerrit
since Jun 20th was pushed (that should be more than 650)\
there are \~580 commits on master from Jun 20th to Jul 27th
- I (steffen) think this is not the case. We had refs/changes/\*
replicated all the time. Only refs/{tags,heads}/ was missing and due
to the screwed chef setup, we were not able to deploy that update
- Log output: https://gist.github.com/StephenKing/247ad6e3704d41b97a84
- This somehow filled up the build queue on TravisCI, which caused
starvation for other project's builds (and obviously caused some
data base issues?)
Remedy
------
- Travis refused builds from the TYPO3-ci/TYPO3.CMS-pre-merge-tests
repo
- Travis staff contacted us and joined the \#typo3-cms channel in
slack
Open Questions
--------------
- Dan Buch from TravisCI wrote about build requests that are "in the
thousands" for that repo. We replicated maybe \~1000 commits to
master + release branches.
- Some people received, however, also status mails for commit that are
\~4 years old. Not yet clear, why those were built
- Does Gerrit really replicate to remote branches commit by commit?
Conclusions / Next Steps
------------------------
- We should be careful when replicating a large number of commits to a
travis-enabled repo
h1. Post-Mortem 2016-11-20 Network Config Changes
Authors: Steffen Gebert, Michael Stucki
h2. Issue Summary
During two occasions during the last days, service availability was impaired after changing @/etc/network/interfaces@ to add our VPN interfaces.
* Saturday, Nov 12th: backup server
* Friday, Nov 16th: physical host server ms06
Both times @service networking restart@ resulted in a permanent loss of connectivity. As we locked out of the running server externally triggered reboots were required.
Side note: We do not manually re-configure our servers, but use Chef instead. However, IP address configuration is not part of the Chef setup.
h2. Root Cause
While a syntax error was at least partially the reason in one case, we nevertheless experienced the same connectivity issue when running @service networking restart@ with a correct configuration.
The syntactically correct config file was only accepted after the reboot.
h2. Resolution and Recovery
In the first occurrence, we contacted the organization hosting our backup server, asking for a reboot. In the second case, we had means to execute a remote reset.
h2. Corrective and Preventative Measures
All changes to the network configuration should be backed by an automatic revert procedure that would kick in, if not disabled by the operator who remains connected.
According to "this issue":https://bugs.launchpad.net/ubuntu/+source/ifupdown/+bug/1301015, @service networking restart@ should not be used. Instead, use
<pre>( ifdown iface; ifup iface ) &</pre>
However, we are not certain about this, if this would be really sufficient in all cases.
The following procedure should be automatically triggered to prevent further failures, independent of the way to reset networking:
* After 1 minute: Revert the configuration file change and restart networking
* After 5 minutes: Reboot the server
The following "gist":https://gist.github.com/StephenKing/83fedc56137f5640de929b4430f1b653 can be used, assuming that a backup has been created in @/etc/network/interfaces.bak@.
Usage:
<pre>
curl https://gist.githubusercontent.com/StephenKing/83fedc56137f5640de929b4430f1b653/raw/24a7536bc074b575af55e667ccde0a4f3668fd21/reset.sh > reset.sh
bash reset.sh
</pre>
Post-Mortem 2016-11-20 Network Config Changes
=============================================
Authors: Steffen Gebert, Michael Stucki
Issue Summary
-------------
During two occasions during the last days, service availability was
impaired after changing `/etc/network/interfaces` to add our VPN
interfaces.
- Saturday, Nov 12th: backup server
- Friday, Nov 16th: physical host server ms06
Both times `service networking restart` resulted in a permanent loss of
connectivity. As we locked out of the running server externally
triggered reboots were required.
Side note: We do not manually re-configure our servers, but use Chef
instead. However, IP address configuration is not part of the Chef
setup.
Root Cause
----------
While a syntax error was at least partially the reason in one case, we
nevertheless experienced the same connectivity issue when running
`service networking restart` with a correct configuration.\
The syntactically correct config file was only accepted after the
reboot.
Resolution and Recovery
-----------------------
In the first occurrence, we contacted the organization hosting our
backup server, asking for a reboot. In the second case, we had means to
execute a remote reset.
Corrective and Preventative Measures
------------------------------------
All changes to the network configuration should be backed by an
automatic revert procedure that would kick in, if not disabled by the
operator who remains connected.
According to [this
issue](https://bugs.launchpad.net/ubuntu/+source/ifupdown/+bug/1301015),
`service networking restart` should not be used. Instead, use
( ifdown iface; ifup iface ) &
However, we are not certain about this, if this would be really
sufficient in all cases.
The following procedure should be automatically triggered to prevent
further failures, independent of the way to reset networking:
- After 1 minute: Revert the configuration file change and restart
networking
- After 5 minutes: Reboot the server
The following
[gist](https://gist.github.com/StephenKing/83fedc56137f5640de929b4430f1b653)
can be used, assuming that a backup has been created in
`/etc/network/interfaces.bak`.\
Usage:
curl https://gist.githubusercontent.com/StephenKing/83fedc56137f5640de929b4430f1b653/raw/24a7536bc074b575af55e667ccde0a4f3668fd21/reset.sh > reset.sh
bash reset.sh
h1. Post-Mortem 2016-12-01-SOLR-Search-typo3org
Authors: Steffen Gebert, Michael Stucki
h2. Issue Summary
The SOLR search (including the TER listing) on typo3.org was unavailable, caused by a faulty ACL being deployed.
h2. Timeline
- 13:48 Jochen Roth "mentioned in slack":https://typo3.slack.com/archives/typo3-cms/p1480596515011722 that the TER is unavailable
- 14:37 Michael Stucki noticed this comment and "commented in the #typo3-server-team channel":https://typo3.slack.com/archives/typo3-server-team/p1480599426000787
- 14:51 Steffen Gebert started to investigate this issue
- 14:59 "Corrective fix":https://chef-ci.typo3.org/job/chef-repo/86/ being deployed
- 15:00 Search is available again
h2. Root Cause
- caused by recent cookbook upload, a new and breaking version of the ohai chef cookbook (>4.0) was deployed; this changes the way, how the path to plugins is configured
- as a result, our plugin to fix the IPv6 address detection in OpenVZ was not applied anymore
- as a result, the solr cookbook became unaware of the @ip6address@ of the typo3.org server (by the means of Chef search)
- as a result, the solr cookbook deployed the ACLs for tomcat excluding the IPv6 address of the typo3.org server
- the typo3.org server was unable to contact the SOLR server anymore
h2. Resolution and Recovery
A "change to the chef environment":https://chef-ci.typo3.org/job/chef-repo/86/ now enforces the use of the up-to-date version of the Chef cookbook "t3-openvz":https://github.com/TYPO3-cookbooks/t3-openvz
h2. Corrective and Preventative Measures
- monitoring checks were updated to catch this error. While we were monitoring both the search function as well as the TER listing, the search strings seem to be still included in case of errors, resulting in the defect being not detected
- we are about to upgrade our infrastructure to the newer ohai cookbook while still maintaining compatibility with our plugins.
- we will look into Chef's @Policyfile@ feature, if this helps us with both, becoming more confident about what cookbook versions are in use as well as being able to update platform cookbooks without touching every application stack.
Post-Mortem 2016-12-01-SOLR-Search-typo3org
===========================================
Authors: Steffen Gebert, Michael Stucki
Issue Summary
-------------
The SOLR search (including the TER listing) on typo3.org was
unavailable, caused by a faulty ACL being deployed.
Timeline
--------
\- 13:48 Jochen Roth [mentioned in
slack](https://typo3.slack.com/archives/typo3-cms/p1480596515011722)
that the TER is unavailable
\- 14:37 Michael Stucki noticed this comment and [commented in the
\#typo3-server-team
channel](https://typo3.slack.com/archives/typo3-server-team/p1480599426000787)
\- 14:51 Steffen Gebert started to investigate this issue
\- 14:59 [Corrective fix](https://chef-ci.typo3.org/job/chef-repo/86/)
being deployed\
- 15:00 Search is available again
Root Cause
----------
\- caused by recent cookbook upload, a new and breaking version of the
ohai chef cookbook (\>4.0) was deployed; this changes the way, how the
path to plugins is configured
\- as a result, our plugin to fix the IPv6 address detection in OpenVZ
was not applied anymore
\- as a result, the solr cookbook became unaware of the `ip6address` of
the typo3.org server (by the means of Chef search)
\- as a result, the solr cookbook deployed the ACLs for tomcat excluding
the IPv6 address of the typo3.org server\
- the typo3.org server was unable to contact the SOLR server anymore
Resolution and Recovery
-----------------------
A [change to the chef
environment](https://chef-ci.typo3.org/job/chef-repo/86/) now enforces
the use of the up-to-date version of the Chef cookbook
[t3-openvz](https://github.com/TYPO3-cookbooks/t3-openvz)
Corrective and Preventative Measures
------------------------------------
\- monitoring checks were updated to catch this error. While we were
monitoring both the search function as well as the TER listing, the
search strings seem to be still included in case of errors, resulting in
the defect being not detected
\- we are about to upgrade our infrastructure to the newer ohai cookbook
while still maintaining compatibility with our plugins.\
- we will look into Chef's `Policyfile` feature, if this helps us with
both, becoming more confident about what cookbook versions are in use as
well as being able to update platform cookbooks without touching every
application stack.
h1. Post-Mortem 2016-12-22-Gerrit-SSH-Hang
Authors: Steffen Gebert
h2. Issue Summary
Connections to Gerrit SSH (port 2941) were hanging (~every third try one).
h2. Timeline
- Wed, 17:26: Stephan Großberndt reports the problem to the "#typo3-server-team channel":https://typo3.slack.com/archives/typo3-server-team/p1482337598000394
- Thu, 6:59 Starting to investigate this issue
- Thu, 7:45 Gerrit restart resolved the problem
h2. Root Cause
Gerrit did not respond to the client's SSH connection initialization:
!wireshark.png!
The exact reason remains still unknown. There are a couple of threads in the Internet about hanging Gerrit SSH connections. Most of them report @database.poolLimit@ as a possibly limiting factor. We have set them to @36@, this "should be sufficient".
h2. Resolution and Recovery
@systemctl restart gerrit@ resolved the problem.
h2. Corrective and Preventative Measures
The next time this issue occurs, the following should be checked:
- open tasks: @ssh review.typo3.org -p 29418 gerrit show-queue -w@
- open connections: @ssh review.typo3.org -p 29418 gerrit show-connections@
- maybe change log level using @ssh review.typo3.org -p 29418 gerrit logging@. SSH log file is in @/var/gerrit/review/logs/sshd_log@.
- thread dump: @jstack <pid>@ as @gerrit@ user (you can get the @pid@ from @systemctl status gerrit@)
Post-Mortem 2016-12-22-Gerrit-SSH-Hang
======================================
Authors: Steffen Gebert
Issue Summary
-------------
Connections to Gerrit SSH (port 2941) were hanging (\~every third try
one).
Timeline
--------
\- Wed, 17:26: Stephan Großberndt reports the problem to the
[\#typo3-server-team
channel](https://typo3.slack.com/archives/typo3-server-team/p1482337598000394)
\- Thu, 6:59 Starting to investigate this issue\
- Thu, 7:45 Gerrit restart resolved the problem
Root Cause
----------
Gerrit did not respond to the client's SSH connection initialization:\
![](wireshark.png)
The exact reason remains still unknown. There are a couple of threads in
the Internet about hanging Gerrit SSH connections. Most of them report
`database.poolLimit` as a possibly limiting factor. We have set them to
`36`, this "should be sufficient".
Resolution and Recovery
-----------------------
`systemctl restart gerrit` resolved the problem.
Corrective and Preventative Measures
------------------------------------
The next time this issue occurs, the following should be checked:
\- open tasks: `ssh review.typo3.org -p 29418 gerrit show-queue -w`
\- open connections:
`ssh review.typo3.org -p 29418 gerrit show-connections`
\- maybe change log level using
`ssh review.typo3.org -p 29418 gerrit logging`. SSH log file is in
`/var/gerrit/review/logs/sshd_log`.\
- thread dump: `jstack <pid>` as `gerrit` user (you can get the `pid`
from `systemctl status gerrit`)
Then post this information to the gerrit mailing list.
h1. Post-Mortem 2017-01-18: Multiple Services affected by private network flapping
Authors: Steffen Gebert, Andri Steiner
h2. Issue Summary
All services running within the new KVM infrastructure were temporarily affected by issues caused by flapping of internal network connectivity.
h2. Timeline
- 07:13: First monitoring mails indicating issues
- 07:25: Georg Ringer reports in the "#typo3-server-team":https://typo3.slack.com/archives/typo3-server-team/p1484720740000734 channel about connection issues to Gerrit SSH
- 08:07: Issues confirmed by Michael Stucki
- 08:10: Our team is gathering together in the hotel (as we are on a sprint right now)
- 08:30: Good indications for Layer 2 bridging / ARP irregularities on both of the new KVM servers (ms08 and ms09)
- 08:47: "Tweet":https://twitter.com/TYPO3server/status/821624984447680512 informing about service outage
- 09:29: Service availability restored ("tweet":https://twitter.com/TYPO3server/status/821635606040248321)
h2. Root Cause
- Not exactly known
- we have seen incoming packets on physical server on the @br-int@ interface, which are not forwarded to the @vnetX@ interface of the VM
- we have seen incomplete ARP tables on the VMs (@arp -a@)
- we have seen tons (hundreds per second) of ARP packets, mostly for resolving the gateway address (@tcpump -i br-int arp@ on the host servers), which are all correctly answered. Many duplicated.
- @dmesg@ emitted many of the following messages:
> br-int: received packet on t3o with own address as source address
> net_ratelimit: 216 callbacks suppressed
- there was no duplicated MAC address involved
- tinc log files (@/var/log/tinc.t3o.log@) did not include anything helpful
- after shutting down the private VPN (@systemctl stop tinc@) on the physical host, these ARP storms stopped
- starting @tinc@ again restored connectivity to other hosts, without any further ARP storms
The definite cause is unknown.
- One variant is that tinc cause some looping or dropped packets
- Another one that the real issue is the local bridging and tinc was just a side-effect
h2. Resolution and Recovery
<pre>systemctl stop tinc; systemctl start tinc</pre>
resolved the problem.
Once the issue occurs again, please **try the following first**:
- remove tinc from the bridge
<pre>
brctl delif br-int t3o
</pre>
to figure out, if it is a bridging or tinc issue. After checking success, use above commands to restart tinc and restore clean state.
- create a tcpdump for all interfaces:
<pre>
tcpdump -i all -w dump.cap
</pre>
h2. Corrective and Preventative Measures
- we plan to get rid of tinc VPN for the long term, once we have services migrated to the new infrastructure and order rack space for all servers including dedicated internal network
- if this happens again, we could "issue an @INT@ signal to the tinc process":https://www.tinc-vpn.org/documentation/tincd.8 to temporarily switch to debug logging (or @USR1@ / @USR2@ signals to get connectivity status).
- get better understanding, how to debug such connectivity / bridging issues and how / when the kernel drops packets etc.
Post-Mortem 2017-01-18: Multiple Services affected by private network flapping
==============================================================================
Authors: Steffen Gebert, Andri Steiner
Issue Summary
-------------