System outage

Comments

108 comments

  • Official comment
    Scott Riesebosch

    Unfortunately what started out as a server update to improve reliability has not gone as planned. We are working on resolving this as quickly as possible. Our team tested the updates on a staging server prior to implementing the updates, but unfortunately something went wrong when implementing the same solution on our production server and it has now taken down all users.

    So sorry about this everyone.

    Comment actions Permalink
  • Ktwalsh52

    I use IT but have no tech education. What I do understand is that even with this outage I probably have lost more time due to Comcast outages in my area than Tailwind. I pay Comcast over $200 per month and they haven’t provided any compensation for the time they are down, so what do you expect for a one time charge of $100. Scott will get this fixed and we will be back to having the best product on the market with no monthly fee.

    How do some of you clowns deal with real life issues if you think this is a crisis?

    7
    Comment actions Permalink
  • Scott Riesebosch

    Hi Everyone,

    Thanks for the overwhelming positive support. We are back up and running and we have a much, much better understanding of things now. It appears the Certificate issue is resolved. There is one process on the MQTT server that seems to flare up occasionally and we are keeping an eye on it, but that was not the cause of the service disruption.

    There are a few things still not online such as IFTTT, Alexa is having trouble discovering new devices, Smart Things isn't working, and a couple other things but we are working on all of them.

    I also want to give a big thank you to customers that have offered their kind words and one even joined in on the session with our team tonight (Thanks John and Gabe!!)

    Local control is being worked on, and so is a high availability server.

    Thank you everyone for your suggestions on where to look on the certificate issues. It just goes to show how blessed I am to have customers like you. I brought all your suggestions to the team.

    I have to get back to the team now. We are not done fixing things yet.

    6
    Comment actions Permalink
  • Cain Brewer

    This is just getting silly. Like the 4th outage I've been impacted by lately, what changed? Flawless service up until a couple of months ago!

    5
    Comment actions Permalink
  • Joseph Fiore

    I have no affiliation with the company... just throwing this out there. It's been down twice (maybe three times) since I've had it (April last year). Do I /want/ it to work perfectly every time? Of course. But, without monthly service fees, I expect a problem now and then. I'll certainly update that opinion if it gets more frequent.

    It stinks when it's rainy, dark, and you've gotten used to just driving up. But, by my calculations, it's worked about 99% of the time I've had it. Everyone has their tolerance for these things but this is within mine.

      --Joe

    5
    Comment actions Permalink
  • P

    This thread really shines at showing those who buy IoT products without any technical knowledge and complain without knowing what's behind it versus those with actual technical knowledge or an IT background.

    Count me as one of those with an IT background. Everything here points to Tailwind learning from their mistakes and improving on it. This is the first major outage in the two years I've had it as well. In comparison to other products that make you pay a monthly fee, lack the same features and such I'm still happy with this.

    I've been on the end of a software or infrastructure change where you deploy something and everything points to a high level of success and then it still goes south even with backups and back out plans, it's just part of IT. Comments like "The device should have everything it needs to continue working. Yeah " show a distinct lack of knowledge of out IoT devices function, a majority of other IoT products will fail when their back-end collapses or has issues.

    The other person @dawgfaj  literally said "ll the technobabble and excuses for TOTAL OUTAGE are unacceptable. Since there hasn’t been any offer of compensation for this constant level of inconvenience, " might be the dumbest thing I've heard. That means no provider of any service can ever explain an outage to him.

    5
    Comment actions Permalink
  • Scott Riesebosch

    Hi Everyone,

    I have another update. We were able to fix the certificate issue in the early hours this morning. We brought everything online as a test a couple hours ago and everything came back up, and we sent some test commands through. Everything worked. We took it back down for a bit to make some additional adjustments, but we have resolved the single largest issue, which was certificate related.

    The main point is - it came back up and we saw virtually everyone's controllers come back online this morning for a short time during our test. Once our team has completed some changes to settings and run some more tests, we will bring everything back up again and hopefully stay that way. We will continue to monitor it closely throughout the day, and i'm already preparing tomorrow's scrum meeting - with a single focus on DevOps / reliability. After this horrible experience I am laser focused on this. I'm sorry but feature requests will have to be delayed for a bit. Hopefully not too long. I've brought in some outside DevOps professionals to move things along quickly, and even some of our customers have offered to help. God bless IT people. What a wonderful bunch.

    To those of you in IT, I salute you. I am a hardware guy. IT / software is a tough job, and when things go wrong thousands of people want answers - and they deserve answers. This is why I've tried to keep people informed along the way the best I can.

    We learned some things on the last outage and we were in the process of implementing some of them when this one occurred. None of us like these fire drills, so please know that I have, and will continue to put more resources into reliability. Tailwind is not run on a server in my garage contrary to what some folks have said to me. Our servers are on AWS, and we do have a backup server.

    Looking forward to putting this one to bed in the coming hours so we can get some sleep.

    We will push out announcements once things are up and running and appear to be back to normal. You may already be seeing your devices back online right now, but the team is still making some adjustments so please try to refrain from saying "Hey might light is green - I tried it and it still doesn't work". We will announce when it is stable again :)

    Thank you so much for your kindness and patience. It really does mean a lot to me, my family, and the team here behind the scenes.

    5
    Comment actions Permalink
  • Derek

    Scott - I initially came here when the outage started to gripe (see my first comment). But after seeing your responses and learning a few things from others here, my perspective has changed. It’s obvious you’re dedicated to and passionate about your product. Which means I can be, too. My first instinct when I see a product or service have multiple issues is to jump to the conclusion that the company just doesn’t care about providing a good experience. That’s obviously not the case here. So I’m still very happy to be on #TeamTailwimd. Excited to see how things improve and evolve based on the lessons from this experience.

    5
    Comment actions Permalink
  • Egon Rinderer

    One hyphenated word and another word: micro-services architecture. There's a reason commercial cloud service providers (SaaS/PaaS/etc.) are M-SA based...you can roll out changes to containers slowly and only impact a small number of users and immediately roll back if necessary. Spread the services globally. Outages basically become a thing of the past (unless your cloud provider has a major outage). Anyway, retool the back end, throw it in Fargate (or whatever vendor's like-kind service you prefer) and forget about it. Life is SO much easier. I feel for the folks at Tailwind fighting these server based outages. All the focus goes into dragging legacy architecture along as opposed to modernizing. It's the plight of a fast growing small business trying to scale. Anyway, from a customer's perspective, while it sucks to have the service down, I've got traditional openers and some patience for the situation. The service is free, so I am not going to complain (yet). A local API would be really nice, though. 

    4
    Comment actions Permalink
  • Preethum Prithviraj

    The benefits still outweigh the problems given that there's no recurring fee. However, I agree with a few of the comments above: 1) Would really like an e-mail notification when this happens so that I don't spend time troubleshooting my own network first, 2) Would LOVE a locally accessible API to be available as a fallback (for our own coding access and for direct from the app into Tailwind device on the same network)

    4
    Comment actions Permalink
  • Scott Riesebosch

    Hi Everyone and thanks for the feedback. Anyone who knows me knows I tell it like it is. I'm not a sales guy. I'm a hardware engineer.

    As you can see we are STILL down, but at least we have narrowed the problem down to a certificate related issue. We were performing an upgrade to the MQTT server portion of the system to improve reliability and set the stage for much higher capacity (after successfully testing it on a staging environment). When we went live on the production server,  it suddenly refused to accept the certificate. When we immediately rolled back it also would no longer accept the certificate. We tried generating a new certificate and it won't accept that either. There doesn't seem to be anything wrong with the certificate itself because it works fine with other applications - it just doesn't want to work with the MQTT application.

    I've brought in multiple outside contractors to assist with it, and we are also in communication with the MQTT software company. The issue is quite clear from the logs. The MQTT broker is not accepting a valid certificate that works fine on other applications.

    I will post more details as they become available. If I go silent for a few hours it is because I am focused on getting this solved for you. Many of you have actually become more like friends than customers and I truly appreciate that.

    4
    Comment actions Permalink
  • Derek

    This is the second time this happened since I purchased Tailwind in October. I’m already contemplating ditching it. Not sure what the problem is, but it doesn’t seem like you guys are addressing the root cause.

    3
    Comment actions Permalink
  • Scott B

    @ Randy,
    If you have a high ceiling that requires a ladder you may want to consider buying a smart plug and plug tailwind into it so that you can power cycle it from your phone.
    Added benefit would be that you can power down tailwind to lockout access when going on vacation, etc...

    3
    Comment actions Permalink
  • James Kirk

    I love my device and look forward to when it's working again. First issue I have had in 3 years. I am the only user and would love to see local control kept on a phone. Keep going Scott I am sure you can figure this out.

    3
    Comment actions Permalink
  • Sam N

    More words of encouragement from another IT professional. I'm sure you will have lessons learned from this experience. Going forward I would suggest sticking with both push notification and email notifications advising that an upgrade is happening, not just a reactive outage message. For those IT professionals complaining on this forum, please try to empathize and remember what it's like to be on the other side of the table. This is a one time purchase piece of hardware with no recurring costs. One outage in 2 years for me is totally acceptable, especially with the response on this forum from Scott owning up to everything. Good luck Scott!

    3
    Comment actions Permalink
  • Lvlucky

    Hey Scott and Team - It happens.  I've been nothing but happy with my Tailwind and the service that has come with it.  These things happen.  Things don't always go as planned.  In my world it won't kill me to use my manual garage door opener until the problem is resolved.  I appreciate your direct communication and sharing all the info with us.  As we have come to see over the last 9 months or so we are not a very patient country as of late.  Hang in there.  Thanks again,

    3
    Comment actions Permalink
  • Scott Riesebosch

    Hi Everyone,

    Thanks. I do have to agree with the folks that say it should not be down this long. I think we can all agree on that point for sure. So here's where we are at as of now. We found the issue with the MQTT server and got it working with the certificate again. That also means that all the phone apps are now back online. The bad news is that none of the controllers are. For some reason the firmware is saying the certificate is not properly signed, when in fact it is - we checked multiple times.

    So the current thinking is that in the effort to get things working again with the MQTT server, something was changed that the controllers didn't like. So it seems like once we sort out why the firmware is still rejecting the certificate we'll be fully operational again.

    Now to comment on the fact that this is another outage that should not have happened. Based on what we learned from the last outage I took 2 steps.

    1) We switched to a larger AWS instance for both the production server and the backup server.

    2) I started interviewing DevOps professionals to add to our team.

    Sadly, this outage occurred before the new guys could really get started and put the updated architecture into place, and occurred despite the larger instances.

    The moment things are back online we will have an immediate 1 week sprint to get the highest impact tasks completed for reliability improvement, and I have a number of DevOps people advising on that. We will continue 1 week sprints until the DevOps people tell me we are in rock solid shape, and I will ask them to prove it to me.

    I know that many of our valued customers / friends are actually software developers. I have never hidden the fact that I am NOT a software developer - which really frustrates me. If anyone has links to High Availability / Reliability standards they would like to share with me I am more than happy to add them to the list of "Here is the standard we need to meet". I'm an engineer at heart and I live on standards / best practice documents.

    Unfortunately I have get back to emails now. I am pretty much the single point of contact, and well, there are thousands of emails waiting in my inbox. Doing my best to reply to every one of them.

    Scott

    3
    Comment actions Permalink
  • Scott Riesebosch

    Good morning everyone. I woke up this morning to panic messages again. We restored our security certificate only to discover this morning that it also has been invalidated. That is why the devices are not connecting. We are working on the certificate issue with our certificate provider. We currently do not know why they keep invalidating it, but that is the issue we keep facing. Our server is running fine - no issues there.

    More to come.

    3
    Comment actions Permalink
  • Ian Holden

    I've had and been using tailwind for a couple of years now and never had any issues with outages for extended periods of time.  When I have had a problem (normally user error !) Scott has emailed me back promptly with a detailed explanation to fix the issue I was having.  Yes, it is a little frustrating the system has been down due to the server upgrade but REALLY people.  It is a non subscription service which nowadays is a rarity when a lot of other smart home are going that way (IFTTT, Ring, Blink, Wyze etc etc).  I have complete confidence in Scott and the team to get this fixed as quickly as they can.  I've had ios notifications saying it is down or whatever and Scott has said he agrees and will send out emails in the future.  The system in my eyes is still a lot better than anything else out there and as soon as I buy myself a 2nd garage opener I'll be purchasing a 2nd tailwind 100%.  

    Keep up the great work and support Scott.  Cheers

    3
    Comment actions Permalink
  • Chris Griesemer

    System is offline in Texas as well. I haven't had any other outages with my smart home products in the time I've had Tailwind go down three times. Getting a little tired of the unreliability, especially when I have it to make my wife/family's lives easier/more convenient. This only adds more consternation and frustration.

    2
    Comment actions Permalink
  • Scott Riesebosch

    Hi Everyone, 

    Thanks for your patience. We actually do have our server in the cloud. It's on AWS. I selected AWS because I wanted the highest reliability server. Ironically we were doing an upgrade to the server to increase the reliability when something went very wrong. I am personally quite upset because I put additional resources in place to keep this from happening. There are changes being made to our team. This should never happen to this degree.

    Also, it appears that a number of people are not receiving the notifications that are being sent through the app. We've sent 3 of them. We wanted to use the notifications because they're immediate, but I hear you - we need to send emails as well.

    We are testing local APIs internally already. In the interest of turning something terribly negative into a positive, may I please ask how you would like the local API to work when it comes to shared access? The owner's phone / account would always work, but most people share access either full access or time of day restricted access.

    The one thing we struggle with is a situation like the following:

    You are the owner. You share access with someone else. Then you revoke that access but that user has taken their phone offline from the internet, and your Tailwind device is offline from the internet. This opens up a security risk because that shared user would still have access to your Tailwind controller and open the door until the Tailwind controller is reconnected to the internet and receives the updated permission status for the shared user.

    I'm going to guess that this is an acceptable risk to most users but would like to ask.

    2
    Comment actions Permalink
  • Egon Rinderer

    I hear you, Scott. My point was VMs in the cloud still go down (even with redundancy). Moving to a serverless MSA approach brings reliability levels that simply can't be achieved by a "server in the in the cloud" approach (especially when it comes to rolling out upgrades, etc.). Anyway, not relevant for this thread. Appreciate your hard work to get things back to stable working with your current architecture. 

    As for local API: that level of "how it works" would (in my mind) be left to the whim of the owner. Just expose (with proper auth) the API and let folks create their own tooling around it. If I want to use my RasPi that I use for a dozen other things for integration, sobeit. I wouldn't be looking to you to create the "how it works". Just the interface. Others views may well vary. 

    2
    Comment actions Permalink
  • Dave Sullivan

    Is it possible to flip the switch and say if the device is connected to my wifi I could be on the same wifi network and access it? If you're not on my wifi then it would not work.

    2
    Comment actions Permalink
  • Scott Riesebosch

    Agreed Dave. Just sent another one now.

    2
    Comment actions Permalink
  • John Stiles

    Fingers crossed for it not to be a time/date issue *cringe*

    Scott, put my vote in the "all in" pile for local control. The revocation scenario seems very niche if both the device and revokee's app would need to be offline to deny access. Even if the device needs to be online to commit permission changes, it's still more functionality than it has now. Then again, I also consider it acceptable risk that someone could plausibly just walk to a nearby window and shout "hey google, open garage door one....1234".

    2
    Comment actions Permalink
  • Tony Adams

    Kevin, I think your are correct. I also setup line yesterday and have the yellow light and the app telling me the garage door is offline. I hope this issue is resolved shortly as I want to be able to play with this device 😂

    Maybe this is a good case for having a local offline api? (I'm a hubitat user and saw that this local api is on the roadmap, which is one of the reasons I purchased this over others)

    2
    Comment actions Permalink
  • Mark Straub

    Love Tailwind. You have my suppot

    2
    Comment actions Permalink
  • Joshua Haigh

    Down in Yorkshire, UK, this does seam to be getting regular.

     

    1
    Comment actions Permalink
  • Ken

    Brisbane Australia = down as well

    1
    Comment actions Permalink
  • Michael Teator

    I'd ask if anyone wants to buy my two tailwinds but who wants something that always goes down.

    1
    Comment actions Permalink

Please sign in to leave a comment.