System outage

Comments

108 comments

  • Tim Osborne

    Scott,

    Since you are using AWS, have your team start using their Blue/Green deployment method.

    Deploy new code to the blue environment and move your internal teams (who I'm sure use your product) and have then use it for hours/days before you deploy it to everyone else (green environment).

    This way you will know if there is an issue before rolling it out to everyone else.  This allows you to fix and retest any/all issues before deploying to the green environment.

    0
    Comment actions Permalink
  • Scott B

    @ Randy,
    If you have a high ceiling that requires a ladder you may want to consider buying a smart plug and plug tailwind into it so that you can power cycle it from your phone.
    Added benefit would be that you can power down tailwind to lockout access when going on vacation, etc...

    3
    Comment actions Permalink
  • Francesco

    Great idea, @Scott B! Mine is just over my head (Liftmaster 8500 jackshaft opener), but I’d still rather not have to venture to the garage and blindly fiddle with a micro-USB plug. On it. 😄

    0
    Comment actions Permalink
  • Preethum Prithviraj

    I agree with a tiered structure for local access. Option to allow the potential security risk after a warning of the potential scenario and granting local access on a user-by-user basis. Long term, if the app can send direct commands to Tailwind, perhaps the interface could allow the owner-credentialed app to over-ride the locally stored permissions for that scenario. I realize that's far more complicated with synchronization and would require testing, but pipe-dreams and all...I'm also in favor of just having the authorized access to the API directly to write my own routines, but for the vast majority of scenarios and users, I think app to device on local network would provide a sufficient backup for a system outage (or even if local broadband went down, but internal networks were still live). For instance, I have a generator/UPS, so when a storm knocks out power, I'm still good, and while broadband rarely is affected, when it has happened, all my internal systems continue to work (think IP cams, security, nas, plex, etc).

    On a secondary note, you indicated you sent out 3 notifications. I checked my app and didn't see evidence of that. Not to say that I possibly didn't just accidentally swipe a notification. But for those types of notifications, are they logged in the history like the open/close is? And on that note, as the manual notification system was still active, for the future, could you add a flag within the notification system that indicates a system outage (time of outage, last update time for your team) that could then display in the app?

    Thanks to you and your team for the effort and improvement goals. I realize some of the new users have seen it more frequently, but overall I've been far more impressed with Tailwind than friends who have other systems. So I still think this one's the right way to go.

    0
    Comment actions Permalink
  • Dave Sullivan

    Is it possible to flip the switch and say if the device is connected to my wifi I could be on the same wifi network and access it? If you're not on my wifi then it would not work.

    2
    Comment actions Permalink
  • Mike

    This certainly is a disappointment. As an IT professional, I’m questioning your teams ability to understand basic concepts of resilience & redundancy, especially using public cloud hyperscalers & services. As for recovery, backup much? Snapshot much?
    It’s should not be that hard.

    1
    Comment actions Permalink
  • Joseph Fiore

    Dave Sullivan said: "if the device is connected to my wifi I could be on the same wifi network and access it"

    This is what I thought we were talking about as a first pass. If you revoke someone's access, presumably you'd change your WiFi password. It's not ideal, you can't access from around the world, but a good emergency backup. The other bother is, driving up, you'd have to wait to connect to your internet.

    0
    Comment actions Permalink
  • Peter Bizior

    Any update on when this can be resolved? I thought it was my device/WiFi but after double checking everything I couldn’t find the problem. Would appreciate if you can give us ETA if you have one?

    Thanks.

    0
    Comment actions Permalink
  • Scott Riesebosch

    Hi Everyone and thanks for the feedback. Anyone who knows me knows I tell it like it is. I'm not a sales guy. I'm a hardware engineer.

    As you can see we are STILL down, but at least we have narrowed the problem down to a certificate related issue. We were performing an upgrade to the MQTT server portion of the system to improve reliability and set the stage for much higher capacity (after successfully testing it on a staging environment). When we went live on the production server,  it suddenly refused to accept the certificate. When we immediately rolled back it also would no longer accept the certificate. We tried generating a new certificate and it won't accept that either. There doesn't seem to be anything wrong with the certificate itself because it works fine with other applications - it just doesn't want to work with the MQTT application.

    I've brought in multiple outside contractors to assist with it, and we are also in communication with the MQTT software company. The issue is quite clear from the logs. The MQTT broker is not accepting a valid certificate that works fine on other applications.

    I will post more details as they become available. If I go silent for a few hours it is because I am focused on getting this solved for you. Many of you have actually become more like friends than customers and I truly appreciate that.

    4
    Comment actions Permalink
  • Dave Sullivan

    During times like this you cannot over communicate. Highly recommend putting out a notification on the app and letting people know again.

    0
    Comment actions Permalink
  • Dave Sullivan

    Also, thanks for the update. I've been a customer for quite a while and I really appreciate everything that you do here.

    0
    Comment actions Permalink
  • Scott Riesebosch

    Agreed Dave. Just sent another one now.

    2
    Comment actions Permalink
  • Dave Sullivan

    Got it!

    0
    Comment actions Permalink
  • Peter Bizior

    Thanks Scott for providing details on it, much appreciated. I feel that in situation such as this transparency and communication is the key. Good luck with overcoming this.

    Thanks.

    1
    Comment actions Permalink
  • John Stiles

    Fingers crossed for it not to be a time/date issue *cringe*

    Scott, put my vote in the "all in" pile for local control. The revocation scenario seems very niche if both the device and revokee's app would need to be offline to deny access. Even if the device needs to be online to commit permission changes, it's still more functionality than it has now. Then again, I also consider it acceptable risk that someone could plausibly just walk to a nearby window and shout "hey google, open garage door one....1234".

    2
    Comment actions Permalink
  • James Kirk

    I love my device and look forward to when it's working again. First issue I have had in 3 years. I am the only user and would love to see local control kept on a phone. Keep going Scott I am sure you can figure this out.

    3
    Comment actions Permalink
  • Sam N

    More words of encouragement from another IT professional. I'm sure you will have lessons learned from this experience. Going forward I would suggest sticking with both push notification and email notifications advising that an upgrade is happening, not just a reactive outage message. For those IT professionals complaining on this forum, please try to empathize and remember what it's like to be on the other side of the table. This is a one time purchase piece of hardware with no recurring costs. One outage in 2 years for me is totally acceptable, especially with the response on this forum from Scott owning up to everything. Good luck Scott!

    3
    Comment actions Permalink
  • David

    @Sam Neahn: Not sure what “plan” you’re on but this is the 3rd 24+ hour outages this half of the year. So not sure how your system was only impacted once over the past 2 years.

    0
    Comment actions Permalink
  • Sam N

    I would say that due to COVID we have definitely been driving less this year altogether, so I can't speak for everyone regarding the number of outages I've personally noticed. If local control as a failover comes out of this outage, everyone wins.

    1
    Comment actions Permalink
  • Lvlucky

    Hey Scott and Team - It happens.  I've been nothing but happy with my Tailwind and the service that has come with it.  These things happen.  Things don't always go as planned.  In my world it won't kill me to use my manual garage door opener until the problem is resolved.  I appreciate your direct communication and sharing all the info with us.  As we have come to see over the last 9 months or so we are not a very patient country as of late.  Hang in there.  Thanks again,

    3
    Comment actions Permalink
  • Earl Crane

    I’m sorry friends, I broke it.

    True story, I let my 2 year-old have my phone a couple days ago and a Tailwind notification opened up the app. I’m usually really good about putting it in Guided Access, but it was a little crazy so I didn’t get to this time.

    When I got the phone back from her, the app was open on the config screen. I closed the app and figured I would check it later, but then I noticed the garage doors were not responding as usual (schedule, arrival, etc). 

    I spent a day uninstalling, reinstalling, rebooting - figured it was just me. Then I came here and saw that my toddler must have Godzilla’d the world. Sorry ‘yall.

    Austin, TX

    0
    Comment actions Permalink
  • Scott Riesebosch

    Hi Everyone,

    Thanks. I do have to agree with the folks that say it should not be down this long. I think we can all agree on that point for sure. So here's where we are at as of now. We found the issue with the MQTT server and got it working with the certificate again. That also means that all the phone apps are now back online. The bad news is that none of the controllers are. For some reason the firmware is saying the certificate is not properly signed, when in fact it is - we checked multiple times.

    So the current thinking is that in the effort to get things working again with the MQTT server, something was changed that the controllers didn't like. So it seems like once we sort out why the firmware is still rejecting the certificate we'll be fully operational again.

    Now to comment on the fact that this is another outage that should not have happened. Based on what we learned from the last outage I took 2 steps.

    1) We switched to a larger AWS instance for both the production server and the backup server.

    2) I started interviewing DevOps professionals to add to our team.

    Sadly, this outage occurred before the new guys could really get started and put the updated architecture into place, and occurred despite the larger instances.

    The moment things are back online we will have an immediate 1 week sprint to get the highest impact tasks completed for reliability improvement, and I have a number of DevOps people advising on that. We will continue 1 week sprints until the DevOps people tell me we are in rock solid shape, and I will ask them to prove it to me.

    I know that many of our valued customers / friends are actually software developers. I have never hidden the fact that I am NOT a software developer - which really frustrates me. If anyone has links to High Availability / Reliability standards they would like to share with me I am more than happy to add them to the list of "Here is the standard we need to meet". I'm an engineer at heart and I live on standards / best practice documents.

    Unfortunately I have get back to emails now. I am pretty much the single point of contact, and well, there are thousands of emails waiting in my inbox. Doing my best to reply to every one of them.

    Scott

    3
    Comment actions Permalink
  • Jacob DePriest

    Thanks for the details! I’ve been in your shoes. Keep up the good work and the positive attitude. Big fan of this product and service. Glad to hear the DevOps plan - sounds like your on the right track.

    0
    Comment actions Permalink
  • Daniel

    Hey Scott , thanks for the update, was one of the guys sending email after believe myself not able to do anything, a message on the app could be really useful 

    After a year using this product on several different setups still need to congratulate and let you know you have the support of all the users and community.

    keep strong and ahead 

    cheers from Costa Rica

    0
    Comment actions Permalink
  • Sam Dunham

    Disclaimer: I am also not a software developer.

    That said, one doesn't need to be a software developer to have addressed this issue properly. Tailwind has not addressed this issue properly. What I AM is a system administrator. AS a system administrator, I know the value of having a backup of any functioning system before making a change that could potentially cause a system outage, you have a functioning backup. At my job, before we roll out something as simple as a Windows update on a server, we make sure we have a complete backup of any virtual machine. If something breaks after the update, we can have the system back up and running within twenty minutes. There is absolutely zero excuse for a production system being down this long.

    You want best practice documents? I'll make it simple. One, create a backup of the production VMs. Two, make the change. Three, if the change doesn't cause any issues, you're golden. Four, if the change brings the entire system down for your customers, you kill the VM with the changes and spool up the backup VM as the live VM. Done. Go back to the lab environment and try to discern what caused the problem.

    You said all of this was on AWS, so every bit of that should be reasonably trivial. This isn't a software development issue, it's a fundamental system admin issue. Personally, I LOVE Tailwind. I've recommended Tailwind to neighbors. But y'all need to get this process fixed. I completely understand outages. They happen. This one, specifically, was self-inflicted (and it's still down - again, no excuse).

    -1
    Comment actions Permalink
  • Sam N

    As mentioned above, it's a certificate issue. I've had backups upon backups of both physical and virtual environments but if a certificate fails for whatever reason it's not always as black and white to recover from. I used to be a systems admin. Now they report to me. Scott is already owning up to it and has documented steps to rectify going forward. I think only good can come out of this unfortunate event based on his remediation steps going forward... But in a perfect world it shouldn't have happened in the first place, and yes this is quite a long outage for those who absolutely need the internet contacted functionality. I for one can get by with my clicker / Home link until things are resolved.

    1
    Comment actions Permalink
  • John Banner

    The device should have everything it needs to continue working.  Yeah cool thanks for pushing thru updates but now that something's gone wrong there should be a stand alone mode that continues to work so long as the the wifi/bluetooth/power, etc is operational. it should never be that something from outside my home network breaks the unit. And if you are going to push updates to my device I'll need to be the one who gives you a greenzone, like when I'm not out of the house and come home to find myself trapped in my drive way. And where's your rollback plan?  

    -3
    Comment actions Permalink
  • dawgfaj

    Wouldn’t recommend this product/SERVICE to my worst enemy!!!  All the technobabble and excuses for TOTAL OUTAGE are unacceptable.  Since there hasn’t been any offer of compensation for this constant level of inconvenience, the only option I have is to hit all sources of social media and product marketing to warn unsuspecting customers of poor reliability.  WOW, what ever happened to integrity!!!!

    -4
    Comment actions Permalink
  • Scott B

    Wow guys.
    1st world problems I guess....

    0
    Comment actions Permalink
  • P

    This thread really shines at showing those who buy IoT products without any technical knowledge and complain without knowing what's behind it versus those with actual technical knowledge or an IT background.

    Count me as one of those with an IT background. Everything here points to Tailwind learning from their mistakes and improving on it. This is the first major outage in the two years I've had it as well. In comparison to other products that make you pay a monthly fee, lack the same features and such I'm still happy with this.

    I've been on the end of a software or infrastructure change where you deploy something and everything points to a high level of success and then it still goes south even with backups and back out plans, it's just part of IT. Comments like "The device should have everything it needs to continue working. Yeah " show a distinct lack of knowledge of out IoT devices function, a majority of other IoT products will fail when their back-end collapses or has issues.

    The other person @dawgfaj  literally said "ll the technobabble and excuses for TOTAL OUTAGE are unacceptable. Since there hasn’t been any offer of compensation for this constant level of inconvenience, " might be the dumbest thing I've heard. That means no provider of any service can ever explain an outage to him.

    5
    Comment actions Permalink

Please sign in to leave a comment.