|
|
|
|
|
System StatusRecently Fixed IssuesAustralian Server Issues08:00AM to 09:45AM Thurs Feb 24 2011A networking issue between Zeald and our Australian servers meant .com.au sites failed to load, or did so intermittently. Outage on the orcon cloud affecting some zeald websites04:55 AM to 10:14AM Wed Jan 26 2011Website hosted on the orcon cloud, including many of zeald's clients and also some of our own websites such as http://help.zeald.com, were offline due to the failure of the core fileserver in our server provider's cloud hosting network. Update 10:14AM The underlying issue (a failed fileserver in our providers cloud hosting infrastructure) has been fixed. Most websites are back online, although some are still restarting (they will be back online in a few minutes) Outage on the orcon cloud affecting some zeald websites9:30AM(estimated) to 1:25pm Sun Jan 23 2011:Services affected: Website hosting for some customers Website hosted on the orcon cloud, including many of zeald's clients and also some of our own websites such as http://help.zeald.com, were offline due to the failure of the core fileserver in orcon's cloud hosting network. Orcon technicians were advised of the issue at 10:00AM and resolved the problem at approximately 12:35PM. However subsequent testing showed that for approximately half our servers the fix was not sufficient, and this issue was fixed at approximately 1:15PM. At that time we needed to reboot all the cloud servers for the fix to take effect, which took approximately 15minutes to complete. Slow service on Orcon cloud cluster.9.00-15.00 Thursday 20/1/11. The Orcon cloud cluster servers experience slow loading times due to packet loss. The issue is with our hosting provider. We have advised their engineers of this issue, and have been told that the problem is being looked at.Downtime on Orcon cloud cluster.19.30-20.30 Thursday 16/12/10. The Orcon cloud cluster servers lost internet connectivity. The issue was with our hosting provider. We have advised their engineers of this issue.Issues with popups in the rich text editor, file uploads, and selecting products and categories in coupons and elsewhereSat - Monday 6/12/10A change made by olark, who provide the live chat software used to provide our live chat support in the website manager broke the popups used in the image manager in the rich text editor, image uploads in the item editor, product and category selectors in the coupon editor, and several other popups. We've disabled live chat support while we find a workaround to this issue: we anticipate it will be enabled again soon. Website outageSunday 7 November 2010 - 7:04 - 9:11Services affected: NZ hosted websites All four of our NZ clusters webservers crashed simultaneously on Sunday morning and locked up until they could be restarted. This seems to be caused by a bug in the webserver software we use: we cannot discern any other underlying cause. We are currently load testing a test server to see if we can trigger a similar event in order to isolate further what may have happened. help.zeald.com and domains.zeald.com innaccessable to telstraclear customersTuesday 2 November 2010 - Wednesday 3 November 2010 9:35 AMServices affected: domains.zeald.com, help.zeald.com Customers on Paradise (One of Telstraclear's 3 ISP divisions) were unable to access the domain names help.zeald.com and domains.zeald.com, due to a bug in the paradise DNS servers resulting in them not accepting a (valid) configuration change in our DNS server. We have reverted the configuration on our end which now results in these domain names resolving correctly for Paradise's customers Outage for some NZ hosted websites and emailTuesday 2 November 2010 12:53 - 1:09 PMServices affected: Email, Website Hosting for approx 3% of websites One fileserver used for our NZ-based hosting cluster crashed and needed to be rebooted. This effected the approximately 3% of websites hosted off this server, and also interupted mail checking and delivery. Mail received during this time will still be received but may be delayed temporarily. Overnight outage on incoming customer mail service.Wed 20/10/10Due to network outage primary mail server was inaccessible in early morning on Wed 20th. This affected domain manager and webmail service. While customers could not login to check their mail for the period of this outage, mail was redelivered to a secondary mail server thus no mail was lost. We have restored the webmail and domain manager service in the morning. Intermittent Outage on Wellington-based hosting serversMonday 11/10/10 10:00 - 10:05An outage for an internal DNS server caused some websites to fail to load (returning 500 internal server error). This has been rectified by rolling back the configuration of this server. Power outage in Northcote Datacentre causes major disruption to all zeald servicesTuesday 05/10/10 01:00 - 02:30, 11:00-12:30, 1:30 - 3:45A major Vector power outage in Auckland caused power problems in the Northcote Datacentre (operated by Orcon) that Zeald uses to host our core servers. This happened despite the fact that there are considerable safeguards (battery and generator backup, surge protection etc) in place that are intended to stop this kind of issue. This outage meant that three times during Tuesday 5th October all Zeald servers were powered off for extended periods of time, which caused a number of problems:
Outage on Wellington-based hosting serversFriday 1/10/10 15:06 - 15:36A fibre outage in wellington with our upstream provider Orcon, blocked access to all of zeald's cloud-hosted websites (approx 60% of websites). Outage on Auckland-based hosting serversThursday 30/9/10 02:55 - 08:15A server configuration change to our auckland-based hosting cluster (approximately 40% of NZ-hosted sites, the remainder are hosted in Wellington), lead to these servers crashing when they tried to run overnight maintenance tasks around 3AM in the morning. This configuration change has been rolled back pending further testing Overnight crash on zes15 clusterSunday 19/9/10 21:52 - Monday 20/9/10 08:54:01At 9:52 PM excessive web hits on this cluster caused both servers in it to run out of memory and lock up. They remained locked up until they were manually rebooted 8:54AM Monday. This mode of failure is unusual : generally there are safeguards in place to ensure that if a server is overloaded then it will disable itself and not crash in this way. However possibly due to the speed of the load spike and the small size of the cluster in question these measures did not have time to take effect. We are still analysing logs to see if there is any way to avoid such problems in future Email MigrationWednesday 15/9/10 04:50 - 11:00This morning we have switched all of our customers on our iserve email hosting (generally speaking those that registered/transferred their domains over a year ago) over to a new mail server infrastructure. More info about this is available in our "What's new blog" This has gone well: if you are checking your email via pop3.zeald.com then you are using the new mailserver. We don't anticipate it causing any problems for most clients Known issuesHistorical email not there (yet)We are currently in the process of bringing over historical email in users mailboxes (this process will continue until approximately 10AM). All your new mail is being correctly delivered and you won't lose anything: this is mostly applicable to users who use webmailwebmail address has changedYou should use http://webmail.zeald.com to check your webmailIf you get a username/password error: make sure your incoming pop server is set to pop3.zeald.comInstructions for many popular mail clients are in our online helpold email reappearing in your inboxFor some users who have their mail client configured to check but not delete email in their mailboxes, this may result in them seeing these messages again: we apologise for any inconvenience this may cause, its a one-off even that cannot be avoided. We don't want anyone to lose any mail so we have to copy any mail you might have in your mailbox to the new server: however now outlook and other mail clients will probably think its a new message and redownload it. You can safely ignore/delete these messagesUsers who still have their mail client configured to check an expired/deleted email address may have got errorsUnfortunately it didn't occur to us that many clients who have cancelled their domains or changed or deleted email addresses may still be checking these (deleted) email addresses: in the past while no mail was delivered to them, no errors would have resulted from this. On the new system checking these deleted mailboxes results in a "user not found" error. We have since resolved this (by creating fake accounts for all these mailboxes) so you can safely ignore these errors which should have dissapeared by now, although it would probably be better if you were to delete these defunct accounts.Bounced e-mail marketing e-mails on cloud cluster31/8/10 12:40 - 14:20 6/9/10Due to a mis-configured relay mail server, e-mail marketing e-mails from sites hosted on cloud were bounced with "relay not permitted". We have fixed the mail server in question as well as unblocked affected e-mail addresses. This fault did not affect the order and enquiry e-mails. Cloud clusters down4/9/10 07:40 - 08:25All websites hosted on cloud servers were inaccessible for approximately 45 minutes this morning. There was a fault in connectivity to the specific datacentre (located in Wellington) that these servers are hosted on (possibly this is related to flow-on effects from the Christchurch earthquake this morning) This fault has been resolved by datacentre staff. Authentication Errors for API users2/9/10 17:30 (variable) - 3/9/10 11:05A broken webserver configuration caused api users using HTTP authentication to get authentication errors, with the web service returning 401 Unauthorised and failing to proceed. The outage window for this error varies by site, depending on whether your webserver was reconfigured during that timeframe - we estimate approximately 20% of API users would have been effected. Slow page load on cloud clusters1/8/10 11:30 - 14:30One of weekly maintenance scripts caused high load on NFS server. This affected the loading time of web pages served from cloud clusters. The issue has been rectified. Intermittent outage on cluster 11, NZ Hosting (cloud clusters)7/7/10 12:00 - 12:30One of the servers in the cluster crashed, which caused intermittent outage (approximately every fourth request). Server has been rebooted resulting in issue being resolved. Outage on NZ Hosting (cloud clusters)9/6/10 04:30 - 08:30Hardware issue (the tendency of specific server to drop networking intermittently) causing VPN servers dropping link to cloud infrastructure. The VPN server has been moved to different hardware altogether as the measure to resolve this issue. Outage on NZ Hosting9/6/10 Intermittent outage between 3:20-3:57PM NZST for approx 30% of NZ hosted sitesFailure of the primary DNS server for a number of servers caused an intermittent outage/degraded performance for approximately 30% of zeald sites Outage on NZ Hosting6/6/10 07:26 - 08:11A hard drive failure in a server caused an outage for NZ hosted sites. It seems that while the hard drive failure itself didn't cause problems (our system is designed to compensate for such failures), it was combined with a load spike (an eastern european spam spider, such things crawl all public websites frequently) which with the reduced capacity caused by the loss of the failed server caused a system wide outage in several core services. The failed hard drive has been disabled and we believe it should cause no more problems. Database server overloaded22/5/10 8:00-9:50A server failure combined with higher than usual load for a saturday resulted in a database server being overloaded. This manifests as websites loading very slowly, and in some cases timing out completely. The affected server has been rebooted and we are monitoring the situation closely through the weekend Network outage in one of the datacentre.11 May 2010 15.15 - 15:40There was a network hardware failure in one of the datacetres. The outage affected some NZ websites. The issue was resolved by datacentre engineers promptly. The issue also caused e-mail outage. The e-mail services have been restored. For more information please read here: http://iserve.net.nz/announcements.php Database server outage on zes14 cluster (affects approx 80 sites)8 May 2010 5:10AM - 9:30AMThe database server for this cluster ran out of disk space - this then caused an as yet unseen bug in the mysql database server software, whereby all user accounts were lost and had to be restored from backup before this hosting cluster would work again Broken links on some sites30 April 02.00 - 9.35A product release to fix a bug in url redirect handling unfortunately caused another (worse!) bug, which meant that many normal non-redirected page links were broken. This has now been fixed, a server restart is required to deploy the change and this maybe result in a 30s - 1min outage for some sites. 10 minute outage affecting some the customers.20 April 18.20 - 18.30 Short network outage between datacentres caused some of the sites being inaccessible for up to 10 min. The issue has been resolved. The issue is being brought up with datacentre staff for further investigation.Database replication delay.14 April 09.40 - 10.00 A product release caused a delay of up to 10 minutes in database replication for a period of approximately 20 minutes. This manifested in the back end as changes appearing not to save but the website front ends were unaffected.Websites down on cluster10.12 April 16.30 - 18.00. Sites on cluster10 went down for 90 min. at 16.30 on 12th of April.The cause of this issue was out of disk scenario caused by one of the customers uploading an extremely large amount of image files. As we were resizing volume the file system corrupted, thus prolonging down time. This issue now has been resolved and websites should be functioning normally. Spike in Visitor stats30 March - In the last month google and yahoo changed how they identify themselves to our servers resulting in them no longer being excluded from the website stats, this issue is fixed now but some sites may still be seeing a spike in stats over that time we are working on regenerating the stats for these dates to exclude the search engine visits10min loss of International Connectivity30 March 2010 (from 9:42AM NZDT - 09:53AM NZDT)Services Affected: Internatinal connections to NZ hosted websites A loss of international connectivity by our upstream provider resulted in all NZ hosted websites being inaccessable to international visitors for 10 minutes this morning. Advanced pricing issues for some websites18 March 2010 (from approx 10:30AM NZDT - 1.30pm)Services Affected: Advanced pricing on some websites A product release, base-3.7.3.188, intended to fix pricing errors on some website's with multicurrency, instead caused pricing errors with Advanced Pricing. THis error would manifest itself as the advanced pricing being ignored and the base price for the product used instead. This release was rolled back within approximately 15 minutes on the first report of issues with pricing. However websites that had been reconfigured during this time (the main way this would happen is by editing shipping or some preferences in the website admin) kept the old pricing routines (and thus the bug) until this issue was discovered approximately 1:30 - a full restart of all servers has now been done to resolve the issue fully. Release 3.7.3.188 is currently in testing to try to determine the underlying cause of the problem and why it was not caught by our automated regression testing before being deployed. Problems with pop3 authentication for email22 February 2010 (from approx 3PM NZDT - 5.30pm)Services Affected: Email for some websites Due to an issue with a database server at iserve, authentication for pop3 email is occasionally failing, meaning that customers are asked to enter their username/password again as though it were incorrect. This problem was intermittent so if you keep trying you will eventually get in. Mail is being queued and no email has been lost http://www.iserve.co.nz/announcements.php 22 February - Iserve email hosting down for a few minutes in the morning Downtime - NZ hosting20 February 2010 (from approx 3:30PM NZDT - approx 9PM NZDT)Services Affected: Approx 5% of NZ hosted websites Status : Ongoing An outage of orcon's cloud hosting infrastructure in wellington caused an outage of the websites hosted on it. Downtime - NZ hosting22 January 2009 (from 2:48PM - 3:01 PM NZDT)Services Affected: Approx 20% of NZ hosted website Status : Fixed Hardware failure on a fileserver caused an outage on this hosting network while the server was rebooted. Loss of the primary fileserver caused the secondary reduntant backup fileserver to crash and reboot as well (it seems to be a related hardware failure, caused by the increased load caused by the first failure). We are in the process of migrating websites off of these two servers so we can temporarily decomission them until we can isolate the issue. Update 26/1/2009 The unusual circumstances of this double failure seems to have caused the order counters on the affected sites to not increment for orders made during this time period - this has various unpleasant consequences, the major one being orders potentially overwriting each other. To resolve this a script is currently being developed to identify where this may have happened & resolve it by restoring data from logs (please note that very few sites are likely to actually be effected due to the shortage of time involved). Downtime - NZ hosting18 December 2009 (from 01.42 - 08.45 NZDT)Services Affected: In the beginning cluster4 later all NZ clusters, as well as some mail services. Status : Fixed A combination of hardware failure (disk controller causing kernel panic on one of the servers at 01.42 NZDT) and software malfunction (High-availability system failing to fail-over to the running machine) caused overload on the Frontends, which resulted in the down time. All cluster apart from cluster4 were brought back up at 8.45, while cluster 4 was brought back up at a later time. Configuration on Frontends was adjusted for this scenario, as well as we are investigating the cause hardware and software failures described above. US Cluster was temporarily inaccessible7 December 2009 (from 13.00NZDT - ongoing) Services Affected: US hosted websites are inaccessibleStatus : Resolved Major network gear upgrade in datacentre that is hosting servers affected is in process. The outage consist of multiple 1-5min outages (for each network device that has been installed). The engineers estimate that this will be over by 14.00 NZDT. DOS-related outage on some NZ sites2 December 2009 (from 3:50 - 4:15PM (estimated))Services Affected: Reduced performance and then outage for approx 10% of NZ hosted websites Status : Resolved High load caused by a denial of service attack triggered a hardware failure on a primary NFS server for this hosting cluster. Although the system correctly failed over to the secondary as expected, however the unexpected degraded performance that followed caused a load spike that caused cascading failure in other systems. To resolve, a full reboot was required, causing an outage to this hosting cluster until the NFS server had restarted. Intermittent error on zes5 hosting cluster29 November 2009 8:20 PM - 30 November 2009 9:15AMServices Affected: Intermittent "Catalog not found" error for approx 10% of NZ hosted websites Status : Resolved Due to a denial of service attack on one of our NZ hosted sites, one server was blocked from the database server. This resulted in errors whenever this server was used to service a request. The configuration on the database server has been changed to avoid this problem recurring. High load, slow performance on NZ Hosting20 November 2009 (from 12:15PM NZDT - 1PM NZDT)Services Affected: Reduced performance for approx 10% of NZ hosted websites Status : Resolved A popular website is receiving extremely high traffic due to a (highly effective!) christmas promotion - this is slowing down performance for the entire hosting cluster it is on. Update: 12:57PM We are in the process of bringing spare capacity online to improve this, and carefully managing traffic to minimise disruption this causes. We anticipate performance to improve over the next 15 minutes as these spare servers come online, but we will continue to be managing the load carefully and further performance problems may continue. Intermittent load issues on NZ hosting13 November 2009 (from 10:22PM NZDT - 12:58AM NZDT)Services Affected: approx 10% of NZ hosted websites Status : Resolved Flaws in a scheduled task running on this cluster caused excessive load while generating website statistics. This caused poor performance (and at various times, websites failed to load) due to excessive load. This has been resolved by updating the database in question. Intermittent silence issue with Zeald office phone line22 October (from ~9.00AM NZDT - 11:00AM NZDT)Services Affected: Zeald main phone lies Status : Resolved When zeald main phone line is called, the caller gets intermittent silence instead of greeting. We have reported this problem with our VoIP provider, but at this stage we haven't been given an ETA. This issue affects whole their network. In mean time if you have urgent matter you can contact us via e-mail (support@zeald.com). Alternatively you could try again as the problem is intermittent. UPDATE: we have been contacted by our VoIP provider and were notified that the problem has been fixed. Email, DNS outage for iserve-hosted customers14 October (from 7:43AM NZDT - 11:02AM NZDT)Services Affected: Email and DNS hosted on iserve. Also means website's reliant on iserve DNS are inaccessable Status : Resolved Iserve, who provide email and DNS servers to many of our customers, are having a major network outage. The iserve network is inaccessable from most major New Zealand ISPs, meaning email and website requests will not succeed. More up-to-date information about this outage maybe be available on iserve's status page Update 9:20 AM NZDT Iserve advise that this is an outage in an upstream provider that should be resolved with approximately 1 hour. Update 9:50 AM NZDT The iserve network appears to be reachable now. It may not yet be reachable from all ISPs. The update from iserve is: "Our upstream providers have advised that they have isolated the issue and are currently working to resolve a hardware issue. We will post further updates as information comes to hand" Update 11:02AM NZDT Iserve advise that this problem is resolved Update 10.30AM 15/10 NZDT There has been caching issues with ISPs/customers that have been resolved Downtime on Australian clusterTue 8th September (from 17.00 NZDT to 18.00NZDT)Services Affected: Australian sites that are pointed to old IP address (those who are affected: please point your domain for Australian site to new IP address 119.148.66.58). Status : Fixed There is some networking issue in Australian data centre. This issue is beyond our control, but we have notified technicians from that data centre. It looks like there are only two sites affected. Those affected sites still use old IP address (by the way, please point your domain for your Australian sites to new IP address 119.148.66.58). Slow service, 403 Forbidden on inquiry pagesMon 7th September (from 9.30 to 16.00 NZDT)Services Affected: NZ Cluster Status : Fixed We have been subjected to dDoS attack. From our investigation attack was targeting inquiry forms with the purpose of exploiting them to send spam, the exploit was unsuccessful, but created a lot of traffic. Due to distributed nature of this attack it was very difficult to differentiate the attacker (exploited network of computers on the internet all over the world) from legitimate user, so the temporary measure was to deny access to IE6 users as the attacking machines were posing as plain IE6 (obsolete browser by today's standards). This restriction appeared to IE6 users as 403 Forbidden error when they tried to place an inquiry. The restriction itself lasted aproximately one hour until we pinpointed the difference that allowed us to craft a better measure. The attack from security point of view was unsuccessful, while from service point of view unfortenately caused extreme load on our servers. We have tighened security arround enquiry forms, which could show up as 403 Error if the enquiry form is used abnormally or abused. Downtime on US clusterMon 31st August (from 16.15 to 17.00 NZDT)Services Affected: US cluster. Status : Fixed Due to huge unexpected traffic spike, the US server was out of resources and temporarily off-line. We have allocated more resources to deal with higher load. Downtime on US clusterMon 27th July (from 10.30 to 10.45)Services Affected: US cluster. Status : Fixed This was scheduled 15 min. downtime by Dallas data centre staff. The core network equipment has been upgraded. Downtime on Cluster 3 (NZ)Mon 6th July (from 10.30 to 11.15)Services Affected: Cluster 3 (167 sites). Status : Fixed Due to scheduled maintenance (involving a reboot) of a server in that cluster, we had unforeseen load spike on rest of the servers thus bringing them down as well. It took under 45 min of load to settle down, causing downtime between 20 and 45 min for some sites. Website mail delayedMonday 22nd June (11.30am) Duration: under 3hrStatus : Fixed Website and e-mail marketing mail servers have been upgraded to newer version; due to change in structure of configuration files some of the mail was not delivered until later in the day. The offending misconfiguration has been fixed. Intermittent broken images on zesuk-1 sitesFriday 29th (5.30pm) Duration: intermittent broken images until 12.00 2nd JuneServices Affected: 2 sites. Status : Fixed Due to future decommissioning of the old UK server, we have moved 2 remaining sites to new servers on Friday afternoon, while setting up proxy between that (until the domain names are pointed to new IP) , unfortunately we have missed the firewall rule that rate-limits connections from single IP (as when we tested it there was not enough traffic to trip it). This has been fixed promptly upon discovery. Downtime on NZ ClusterWed. May 13th (from 11.30) Duration: intermittent server errors until 13.00 next dayServices Affected: Legacy cluster - 8 sites, Cluster 4 (190 sites). Status : Fixed Due to scheduled power upgrades, which was part of our constant improvements and future proofing of our infrastructure, one of the servers (whose uptime was over 560 days) from Cluster 4 was shut down, unfortunately something went wrong with file system and we lost one of the databases, which caused the intermittent internal server errors on cluster 4 due to extremely high load (resulted from reduced capacity). Same issue was responsible for losing file server on legacy cluster. We have rebuilt the database server and restored connectivity to legacy cluster file server. Downtime on NZ ClusterWed. May 6th (from 11.30) Duration: intermittent server errors until 16.00Services Affected: Whole NZ cluster Status : Fixed Due to scheduled power upgrades in datacentre, approximately 50% of the servers were shut down and started up sequentially, unfortunately this did not go as smoothly as we hoped, as 50% of the capacity was not enough and load crashed the servers that were serving. We have postponed the continuation of the upgrades until next week. Reboot on US serverWed. April 8th (15.20) approximately 2 min. durationServices Affected: US cluster, 9 sites were affected. Status : Fixed Server resource upgrade required a reboot. This upgrade will ensure that during load spikes server will perform normally. Downtime on Australian clusterTuesday April 7th (17.30) approximately 10 min. durationServices Affected: Australian cluster, 14 sites were affected. Status : Fixed Routing issue in datacentre where the server was hosted caused the range of IP addresses that covers our Australian host to be unroutable. Technicians at datacentre fixed the issue. Downtime on Australian clusterMonday April 6th (20.30) approximately 30 min. durationServices Affected: Australian cluster, 14 site was affected. Status : Fixed Very high load created by three spider bots indexing at same time (google MSN and Yahoo) caused server to unable to serve content. Restart of the server fixed the after effects. Downtime on all New Zealand clustersSaturday April 4th (14.00) approximately 1 hour durationServices Affected: most New Zealand sites. Status : Fixed Runaway subversion process consumed all memory (due to non-hosting server being down), this caused high load on all servers (as every server was running that process simultaneously on all servers). The issue was resolved promptly as the services are under intensive monitoring. Downtime on Australian clusterFriday April 3rd (14.00) approximately 30 min. durationServices Affected: Australian cluster, 1 site was affected. Status : Fixed zesau-2 host was brought down for emergency maintenance due to not responding normally. Downtime on Australian clusterSaturday March 21 (16.30) - Sunday March 22 (04.30)Services Affected: Australian cluster, 14 sites were affected. Status : Fixed High load spike caused the server to run out of memory, resulting in down time. The ultimate resolution of the issue is in the pipeline, new dedicated server is built and ready for sites to be moved to. Undefined catalog error on US clusterSaturday March 7 (0.30) - Sunday March 8 (00.30)Services Affected: US cluster, 7 sites were affected. Status : Fixed Reported Undefined catalog errors were caused by crashed database (runaway process used up all the memory resulting in inability to start new processes). The duration of the down time was approximately 12 hours. Monitoring has been improved to detect such failures in future. NFS lost configuration on the legacy clusterTuesday Feb 24 (9pm) - Wednesday Feb 25 (9.30am)Services Affected: 8 websites (0.7% of NZ hosted websites) Status : Fixed A glitch in fail-over system on our legacy cluster (current clusters were not affected by this problem) caused it to fail-over to a faulty configuration. Due to nature of this failure the monitoring system did not pick up the fault, as the actual servers were running fine. The failure resulted in 403 Forbidden errors. Server restartMonday Feb 9, between 3:00 - 3:20PMServices Affected: Approx 1-5 minutes downtime on 14% of NZ hosted websites Status : Fixed Serious performance problems, caused by an issue with one of our servers necessitated a server restart for all websites to apply a settings change. This resulted in approx 1-5 minutes of downtime for affected sites. Archived/Historical IssuesHigh Christmas LoadMonday Dec 8Services Affected: Website hosting, email marketing Status : Fixed Update 15/12/2008 - 19/12/2008 The addition of four extra servers to our hosting infrastructure appears to have resolved this issue this week. We are monitoring the situation as there is still the possibility that huge traffic spikes on any one site will adversely affect performance - but since there is now a significant over-provisioning of server resource we are much better able to handle this scenario. We do not anticipate further performance issues in the website frontend Update 10/12/2008 Various improvements we have made have had some improvements to this - most websites are operating at normal performance levels. However to help further we are currently building and purchasing four additional "emergency servers" to cover this load (we have limited options for server hardware that can be delivered this side of christmas). We hope to have this extra resource live on Dec 10/11 depending on delivery schedules. Previous information about this outageChristmas 2008 has been the most successful ever for our customers, especially for e-commerce customers running on the zeald platform. Unfortunately this high load is causing intermittent problems, especially when any one site has extreme load spikes due to highly successful marketing.We are monitoring our servers very carefully in order to try and allocate server resources exactly where needed however in some cases our systems are running much more slowly than we would like, especially in the backend administration of sites. We are trying to prioritise front-end customer's concerns over backend users to ensure that orders and enquiries keep flowing. The following services are experiencing the most issues:
Website outageMon Dec 8 2008 8:30AM - 9AM Services Affected: Some websites Status: Fixed A sharp load spike on Monday morning resulted in serious performance degradation for some sites, until servers could be migrated to this cluster from elsewhere.Slow email delivery for email marketing emailsWed Dec 1 2008 Services Affected: Email marketing mail delivery Status: Fixed On December 1, due to an unexpectedely large volume of mail being delivered as part of 2008 christmas marketing, mail delivery for email marketing took longer than usual to be delivered. This has been resolved on our end now by the addition of extra server resource to our email infrastructure. Here are the statistics for how many messages were delivered within specific timeframes on this date:
Under 1m 23.1%
Outage on NZ hosting clusterWed Dec 1 2008 15:00 - 17:00 Services Affected: Slow/non-responsive hosting for Approx 20% of NZ Hosted Websites Status: Fixed Unexpectedly high christmas load on one of our server clusters caused the hosting and database servers to crash. This problem has now been resolved by the addition of more server resource to that hosting cluster. Unfortunately several of the websites with highest load happened to be on the same website cluster. This combined with an inordinate number of website promotions (associated both with the beginning of december and the beginning of the week) ledto a massive load spike on this cluster (approximately four times as many people visited today as on other days). To resolve this we have added three additional servers, rebalanced the load by moving some sites to different clusters, and we are monitoring the situation to see whether additional resource will be required in the leadup to christmas.
Outage on NZ hosting clusterWed Nov 19 2008 10:00 - 11:30 Services Affected: Slow/non-responsive hosting for NZ Hosted Websites Status: Fixed A script released as part of an update to handle higher than usual email load over the christmas season malfunctioned and consumed all memory on a significant number of web servers before the problem could be diagnosed and fixed.Although this particular issue was fixed fairly promptly, the fact that approx. 75% of our server capacity needed to be rebooted in order to free up memory meant that the remaining servers were brought down by the excessive load this generated. As server capacity progrssively came back on this problem was slowly resolved after approx 11:00 - however system performance was degraded for up to 30 minutes after this time. NFS server failure on NZ server clusterFri Sep 25 2008 09:10 - 09:20 Services Affected: Approx. 1/6th of websites on the New Zealand hosting cluster Status: Fixed Overnight filesystem corruption errors caused by a failed hard drive on one of our cluster's NFS server to switch to readonly mode, causing intermittent problems to the websites on that cluster. To resolve this issue we were forced to shut this server down and run a filesystem check - this resulted in a 10 minute outage for all the websites affected.
Slow international connections to NZ-based websitesWed Sep 25 2008Status: Fixed We are experiencing slow international connections from and to our NZ-based website hosting servers. We are investigating this issue with our upstream bandwidth providers and will update this notice as more information is found update 29/9/2008 This problem is still continuing, and it is still unclear what is causing it. There seems to be a very high error rate on international connections, but our collocation provider is unsure as to the cause. We are hopeful that a planned configuration change on their router overnight will fix it. In the meantime we have moved the routing of email to a different connection - this helps reduce the effect of the problem by freeing up bandwidth on our international connection. Temporary Fix! update 29/9/2008 After spending time testing this issue with our Provider's engineers they have managed to find a workaround - increasing our upstream bandwidth limit seems to work around the problem. This probably hasn't solved the underlying issue (as we were nowhere near saturating our international link, increasing the limit should make no difference) but in our testing from offshore locations web traffic is now 100-200 times faster than it has been over the last few days. With this temporary fix we can work to isolate this issue without the problem effecting customers. Orders incorrectly logged in website databaseWed Sep 24 2008 AMStatus: Fixed This morniing a database update broke the logging of orders to the website database. This problem has since been fixed however for the orders that occurred during the outage:
These orders have now been restored to the website databases of the affected websites Slow international connectionsUpdated Wed Sep 17 2008Status: Fixed Over the last several days we have been experiencing major performance degradation on our international link. This has been causing a number of problems:
Email DelayBetween approx 4PM 26/8/2008 and 10 AM 27/8/2008 NZDTStatus: : FixedServices affected: Email sent from websites within our New Zealand hosting infrastructure Mail was queued for a period of 18 hours overnight. No mail was lost, however mail was in some cases delayed. As at 10AM August 27 mail is sending normally - there is however a significant queue of outgoing mail - depending on the rates at which your incoming mail server configuration allows mail to be delivered it may take some time for this backlog to be delivered. This was caused by by disk on one of our mail servers filling up with logs - this issue was resolved quickly but led to the mailserver crashing. Outgoing mail on the webservers was then queued until the mailserver began accepting mail again at approximately 10AM 27/8/2008 Two five minute outages websites on one server clusterBetween 9:30AM and 10 AM 12/8/2008 NZDTStatus: : FixedServices affected: Approx 20% of NZ hosted websites for less than 10 minutes total Possibly as a result of the high load the previous day, one of our secondary database servers experienced corrupted tables (a very unusual scenario we have never seen before - probably caused by a bug in the database server software itself) overnight at approximately 3AM. The system then correctly and automatically removed this server from the cluster, falling back to using just one database server. Note that as all data is replicated on at least two seperate servers within our hosting infrastructure this scenario does not result in data loss for our clients and the system is capable of running with only one database server under normal load, without users of the websites even noticing. The data on this failed database server was not recoverable - in the morning we needed to copy the data from a snapshot of the other database server in order to bring this server online. As this is the same cluster that was effected by the load spike the previous day, a decision was made to do this immediately rather than waiting until outside of working hours - running on just one database server the load experienced the previous day during peak hours would have crashed the system and necessitated several hours of downtime. This involved shutting down the master database server for long enough to take a snapshot of the database, and then start it up again - an outage of a few minutes was involved. The first time this was attempted however the server itself crashed (a kernel bug?), necessitating a 5 minute server restart. After the server was restarted the process needed to be done again (the second time the process completed correctly) causing another outage of a few minutes. Very slow/intermittently inaccessable websites on one server clusterIntermittently 2-2:30PM 11/8/2008 NZDTStatus: : FixedServices affected: Approx 20% of NZ hosted websites One website on our NZ based hosting servers experienced a very high load spike (some 20,000% higher than normal load!) due to a (very successful!) online marketing campaign. This load spike exceeded the usual capacity of our system to adjust to varying system load and led to a number of servers to crash. Once this issue was identified, extra server resource was brought online within that server cluster. These servers take approximately 15 minutes to boot and be configured - however once this completed it quickly fixed the immediate issue (websites failing to load), however websites on this cluster may have experienced degraded performance throughout the remainder of the afternoon Intermittent website outages on UK-based serverIntermittently over the course of 6/8/2008 NZDTStatus: : FixedServices affected: Website hosting on our UK based server On the UK-based server, an issue was discovered where when users of a high-volume website view their website traffic reports too much load is generated on the database server. This causes the websites hosted on that server to either run very slow or to time out and fail to show at all. As a temporary workaround we have disabled website reporting on this server, which will stop this from happening - we hope to fix the bug that causes the underlying issue soon and re-enable access to the website reports. Server outage on NZ-based hostingWed July 30, 12:00 - 12:13Status: : FixedServices affected: Hosting on our Hew Zealand-based hosting infrastructure A reboot of a server within our New Zealand hosting cluster lead to an ip address conflict that resulted in websites returning a "404 Not Found" error. Shutting down this ip address and then resetting the switch were required to fix this issue, which was completed by 12:13, Partial Server outage for US-based hosting serverFriday 25 July, 9:40 PM - Saturday 26 June 11:32 AMStatus: : FixedServices affected: Website hosting on our Dallas based server Over the weekend our USA based server ran into an issue (caused by a memory leak in a piece of system software) where it no longer had sufficient memory to serve websites. Unfortunately our server monitoring system was unable to detect this type of error (as the HTTP service was still "up" however websites themselves were either running very slowly or not at all). This bug in the monitoring system has now been fixed to ensure it does check that a valid page is being generated. The issue itself has also been fixed (the system software at fault has been reconfigured and also a system put in place to automatically restart it should it use too much memory). SSL Certificate Re-IssueTuesday June 6thStatus: : Fixed Services affected: SSL Certificates Due to a bug in security software worldwide we have had to re-issue our SSL certificates. This means they have to revoke the current one to re-issue a new one. There was a downtime in the SSL certificates which customers may have noticed. This has been resolved. UK Server DowntimeTuesday May 20thStatus: : Fixed Services affected: Access to UK Hosted websites There was an issue with our UK servers in which resulted in downtime overnight. This issue has been resolved and the servers are now being closely monitored to ensure this does not happen again. International Speed issuesFriday May 9Status: Fixed Services affected: Access to New Zealand Hosted websites from offshore and some New Zealand ISPs Several customers have noticed slow speeds when accessing websites hosted on our New Zealand servers from offshore. This problem also seems to affect connections from at least one New Zealand ISP (callplus/slingshot). We are investigating the issue with our upstream provider. Update 16/5/2008 There appears to be evidence of a problem with the international bandwidth supplier being used by our upstream provider Thursday May 1 15:00 - 15:35 Status: Fixed Services affected: Websites hosted on brisbane server Websites hosted on our brisbane-based Australian server (note, unless you have specifically requested this, most websites are not hosted there) were unavailable due to a failure of the Optus Uplink in the australian datacentre. Friday March 8 02:09 - 08:56 Status: Fixed Services affected: Website outage for some websites Due to a large load spike on one of our websites, one of our database servers ran out of disk space. This in turn led to a problem loading websites. As all relevent services were still running however, our monitoring systems were unable to detect the fault. Once the problem was detected at approximately 7:30, steps were taken to add disk space to the affected server to get it operational. Then a full server restart was required to get the affected websites live Tuesday December 4 10:00 - 10:20 Status: Fixed Services affected: Website outage for some websites, reduced performance for up to an hour afterward. Page logging disabled for three hours. Effect: Intermittent ability to access websites for some customers Details: A routine database change was made to one of our servers early this morning - however this had unexpected performance implications that slowed this server to the point where it began causing errors on websites & eventually crashed under the built-up load. This problem was fixed twenty minutes later (by restarting the affected server). However flow-on performance effects resulted in reduced performance for several of our larger websites (especially for clients in the admin, and users while completing an order) for up to an hour afterwards. In order to reduce load to within acceptable limits we were forced to disable page logging during this time, meaning that clients will notice a three-hour "gap" in their page view statistics for this day. Sunday 2007-11-25 - Wednesday 2007-11-28 Status:Completed Services affected: Website hosting for some websites Effect: Possible slowdown as your site is copied Details: In order to prepare for the christmas rush we are in the process of performing server upgrades in our hosting platform. This may involve us moving some websites to new servers, a process which should be transparent to the user. However while this is happening users may experience some slow-down, and in some cases a website may need to be switched to read-only mode to avoid order inconsistencies. At most this move generally should take less than three minutes per website. Friday 2007-11-16 8:01 AM - 9:05 AM , again at 9:45 - 10AM Status: Fixed Services affected: Website hosting for some websites (Other websites were unaffected) Effect: Website intermittently slow/inaccessible Details: Disk space was exhausted on one of the three "overflow" servers, which are used to temporarily provide extra hosting capacity to websites experiencing higher-than average load. These servers have recently been added to help handle the Christmas rush & other unexpected load spikes that sometimes affect our clients. This resulted in page requests to that server hanging, causing 1 in 3 page requests to fail. This server was disabled while the problem was diagnosed, but was later re-enabled by another engineer causing the problem to recur later that morning. To avoid this problem in future the scheduled task that cleans up used disk space on these servers has been modified to run more frequently, and more sophisticated monitoring of disk usage has been implemented for these temporary servers.Thursday 2007-9-20 9:30 - 10:24 AM Status: Fixed Services affected: Website hosting for 20% of websites (Other websites were unaffected) Effect: Intermittently, access slow or site inaccessable Details: Unexpected load overwhelmed one of our five file servers, causing it to run extremely slowly. This results in the websites being slowed down (in some cases to the point where people we unable to access them at all). To resolve this we are moving some of our website's file hosting off this server to increase its spare capacity. Within the next month we plan to replace this server with a much more recent and faster system as part of our ongoing upgrade processes.
Tuesday 2007-9-11 2:38 - 2:50 PM
Friday 2007-8-31 10:30 - 10:35 AM Saturday/Sunday 2007-04-29 10:40PM-2:06AM Status: Fixed Services affected: Website Hosting: yes Email: No Domains: No A database server outage meant that many websites were unaccessable during this period. Detailed description Our main database server filled a disk with logs while experiencing much-higher than average load. This outage also affected the SMS notifications from our monitoring system meaning only email notifications worked. Combined with the fact that this outage occured overnight on a Sunday this resulted in the long outage window.
2007-04-26 9:06 - 13:01
Affected Services |
PromotionsDiscover exactly what is holding your website back from generating staggering sales and marketing results for your business. Learn about the Zeald Website Audit. Here’s the correct process … develop a website that is extremely persuasive first and then and only then focus on the promotion of your website.
Learn about Secrets of Website Persuasion Imagine generating hundreds of thousands of dollars in revenue each year from your website. Learn about the Zeald Website Brief. Discover how to generate amazing results online. Learn the fundamentals behind every high performing website on the planet. Do you want to start an online business? ... or ... maybe you have a fantastic new idea or concept that you want to take online? Discover the two keys factors that drive website performance and how to successfully influence those two key factors in a positive way. |