See full event listing

SaaS Orchestration with AWS Step Functions

We’ll have 4 lightning talks covering a wide array of uses for step functions within a SaaS application.

SaaS Fundamentals for Orchestration

Speaker: Bill Tarr

SaaS is more than just an application. It includes concepts and practices like Control Planes, Frictionless Onboarding, Configuration Management, and DevOps practices. In this section, we’ll identify these building blocks, and consider what orchestration means and where it may fit into our SaaS solutions.

Using Step Functions to Achieve Lambda-less AppSync for SaaS

Speaker: Jason Wadsworth

As a builder of SaaS software, do you find yourself looking at AppSync direct integrations with a bit of jealousy? I’ll talk about how you can use Step Functions directly integrated with AppSync to access DynamoDB without the need for a Lambda function, all while maintaining tenant data isolation.

Step Functions to Coordinate Tasks

Speaker: Andres Moreno

When you offer software as a service you’ll need a strategy to import your clients data to continue operations. Andres will go over an approach to migrate data using Step Functions to coordinate different tasks allowing scale to reduce processing times while also getting the necessary visibility to track workflow.

User Onboarding with Step Functions

Speaker: Seth Geoghegan

User onboarding is a core concern of any SaaS solution. In this section, we’ll discuss how Nerdy leverages AWS Step Functions to deliver a robust and fully automated onboarding experience.

From software builder to designer, Bill has 20+ years of experience shaping best-in-class SaaS technology strategies for organizations from startup to enterprise. He’s also an AWS SaaS community leader and producer of the Building SaaS on AWS show on twitch.com/aws, frequent public speaker with experience at top tier AWS events such as re:Invent, and publisher of SaaS best practices on the AWS partner blog.

Jason is a career problem solver, serverless lover, AWS Community Builder, and SaaS enthusiast. He loves to talk about SaaS and how to build solutions on AWS. He is also a collector of LEGO, with sets showing his nerdy side, like his many Star Wars sets, to his love for architecture, like his current build of the Colosseum.

Andres is a principal software engineer at Tyler Technologies with over a decade of experience. He’s a skilled AWS serverless engineer and has spent time as an AWS Community Builder. Andres writes in-depth technical content on his blog where he shares his insights on serverless, CI/CD and cloud security.

Seth Geoghegan is a software developer with a passion for learning and sharing what he knows with the community. Seth has over 20 years experience in software development, operations, architecture and everything in between. In the past several years, he has been focused on serverless application architectures and implementations. Seth is an avid runner, cyclist, photographer, gardener and is working towards perfecting his hommade pizza recipe.

Transcript

Bill Tarr 6:34
awesome, Brian, thank you. So yeah, my name is Bill Tarr, I’m a SaaS evangelist with AWS SaaS factory, and I have the honor of setting up going to it’s going to be several lightning talks here about step functions. So we’re really focused on how SaaS and step functions fit together. So I’m going to set the table with a couple different questions about how we think about step functions and what they mean to SaaS. First, if you’re not familiar with orchestration, let me at least pitch this topic out there. Because both orchestration and choreography will be important. If you’re building SaaS, how are they different? Well, these pictures kind of tell the story, right? Orchestration is really like a flow. If you’ve ever seen a workflow, if you ever built a workflow in draw.io. They look a lot like this. And it’s exactly how we think about orchestration. There’s a beginning and there’s an end. And you see the word onboarding on here. That’s there for a reason. And I’m going to talk about onboarding a little bit in a while. But comparing orchestration with choreographed messages. choreographed messages are often a way that we implement onboarding as well. choreographed simply means we’re throwing events out there, and they’re happening independently of the thing that sent them. It’s an important point that these both exist. And you can think of orchestration as step functions in AWS terms, and choreography, as event bridge, two different services, both valuable to SaaS. But today, we’re going to focus on orchestration. And I like the word workload orchestration. And I hear sometimes people talk about microservices almost immediately, when they start talking about step functions, start talking about decomposing the workloads. But really, what we’re talking about with orchestration is workload orchestration. These are some of the examples of how we want to orchestrate the workload and what we want to achieve. So working backwards from the task we want to successfully do like I want to sequence tasks, I want to retry things that fail, I want to run things in parallel. These are the type of requirements that step functions can help us answer. And in SaaS, we’re going to have a variety of different workloads like this, both in the control plane and in the application plan. And I’m going to define those for you in just a minute as well. Why step functions? Well, we like step functions for a lot of reasons. First, well, they do orchestration, we said why we’d like orchestration, and composing these workloads and orchestrating them. Step functions, let us build state machines. And each of those states is a step, you can think of them as a lambda, you can think of them as a container, call, whatever you like, each of those steps, presents represents some unit of work. And this lets us break down those units of work into smaller units. You move between the states through transitions, and you’re able to easily reuse components, they’re able to take the step functions, and orchestrate the same lambdas, for example, and use them in different ways for different types of workflows. This allows us to have a very extensible workload and be more agile, which is one of the main tenants of what we want to achieve in SaaS. And it lets us do things like visual workflows. I think if some people were familiar with step functions back in the day, they may have a bit of a bad feeling remembering Oh, it was kind of hard to use this Amazon states language ASL, and I had to write a lot of CloudFormation. If you haven’t looked at it in a while, go back and take a look today. It’s been implemented in the CDK. I have a short example down here of a simple CDK function that actually chains together a start. And a couple of actions. I wrote this for a customer example I was working on where we’re creating a batch job, we send that batch job off to one compute, we then send it off to another more expensive GPU base compute, then we send it off to another lambda for some post processing. And you can see that four simple lines of code in our infrastructures code, one of the things I like about step functions is the cool way it can visualize some of these workflows. And again, if you haven’t looked at step functions in a while, you really need to take a look at the interface and see how that this has really evolved. This is a fairly old picture of what these workflows used to look like, if you look at the UI today, is a much more rich and elegant experience for how it handles these visualizations, as well. And you can execute and monitor nice things.

Bill Tarr 10:50
One of the advantages of step functions for everybody, not just SaaS builders, but it’s especially important for SaaS builders, is they’re very easy to track the state of and very easy to troubleshoot. So it’s much more, it’s much easier to take a step function, walk through it in the debugger, than it would be to use a event bridge and fire events over the wall. And it means you still are going to have to track these things, you’re still gonna want to perhaps turn X ray on and monitor the state of these things in production. But when things happen, and they are unexpected, you can easily trace it back to this workflow and figure out exactly what went wrong, where and hopefully why as well. Continuing on, thinking about why we like step functions, Serverless is just a great fit for SaaS. And step functions are serverless. They allow us to match our consumption, our tenants consumption to the cost and the scale and performance of what we’re trying to produce for our tenants. So instead of us spinning up, say, a bunch of servers to manage our workflow, we can simply say, as our customers come in, as our tenants and again, sometimes we use those terms interchangeably in SAS, as our tenants come in and start to consume. Step functions will scale up for us and handle that scale. It’ll handle the performance, keep the performance good as it scales up, and it keeps our costs down, especially in the earlier phases of our project. And when you think about how SaaS can really grow it, it’s a growth model, right? That’s why we’re building SaaS, then you can really start to see why Serverless is such a great fit. Because it really, if you can’t align to that consumption, if you’re simply spiking all of your costs right out of the gate, there’s a pretty good chance it’s not going to be profitable for a long time. So this helps you to right size, your workloads for the customers you have in hand today.

Bill Tarr 12:38
So where do step functions fit into SaaS? Well, let’s break down a few SaaS concepts. You already heard me leak this one before, because I can’t help saying control plane when I’m talking about SaaS, SaaS isn’t just an application. SaaS is a multi plane solution. And those planes include a control plane and an application plane, sometimes either even other planes like a management plane. But a simple definition of a control plane is the place where those of us who are operating the SaaS solution, whether those would be administrators, the technologists go to manage the SaaS experience. So all of the it is really if you want to think about it this way, it’s a single tenant application, right? Or it’s a you know, in a SaaS terms a silo application where where the only users no tenants are logging directly into our control plane. In most solutions. Here, we manage the onboarding experience, the tenants experience, identity, billing metrics, those type of things. But of course, every solution has an application. And that SAS application may be made up of multiple multiple applications itself, right, those services that make up our application, that’s where our tenants interact with our software. So our control plane is there to manage one or many of these application planes. And this is a simple example of what that might look like. And, you know, I mentioned the onboarding story is something that we think might be important for step functions. You can see in this example, that that onboarding story is already starting to look a little bit like a workflow in that control plane. So that workflow, in fact, is kicked off. When somebody comes in and signs up for our SaaS solution. This is what we call onboarding. When a tenant comes in, says, Click, I want to start using your service, we need to have a bunch of different things happening, we need to sign them up into our tenant management solution, perhaps that’s a database, we have to add an initial admin user so that they can log in in the first place. And we have to do some provisioning, because without the application that all of that is perfectly useless, right? So we have to go and provision these application planes. And in this simple example, you can see we have two different environments that our tenants are directly using. And this is one example of how we can set up SAS and already we can see how step functions could possibly be useful at the control plane level. There are these different isolation options in SAS on the left hand side, we have something that looks a lot like what I was just showing Right, where it’s a complete silo and everybody gets their own stack. On the right hand side, everyone is sharing the same solution, right. And this is what we call pooled, that pooled solution meets all of those different services, all the step functions, everything else are shared by everyone. And then in between, we have a bridge solution where some parts of our solution might be shared, some might be siloed.

Brian Rinaldi 15:20
I’m going to introduce a tough concept that we’re not going to be able to dive into enough today. But at least want to say this, since we’re talking about step functions, if you’re going to do a shared solution, on the right hand side of this diagram, you’re going to have to implement runtime isolation. If you have a silo solution, you naturally have some net, we have some networking policies that keep those environments from talking to one another. You have some security policies based around that and I am. But if we’re actually sharing the same infrastructure, if this different tenants data is running through the same step functions, landing in the same lambdas running through the same API gateway, we’ve got to implement some security at runtime, usually, that takes the form of a jot token. And thinking about what this means for step functions, that shared environment where perhaps we have that tenant context coming from through our API gateway, would be produced, perhaps from our lambda authorizer, right, that lambda authorizer takes the Jot token that comes in and translates that into the tenant identity that all of our other service needs. We call this the tenant context, that tenant context for a step function needs to be passed through our entire system. And as you look at this diagram, it should become apparent why right? That tenant context is going to be used in each of those tasks, to determine what tenant is actually executing a specific job. It needs to be used by a different task that’s writing to a database to enforce tenant isolation, to say, Hey, this is Bill coming in right now only give Bill access to Bill’s data in DynamoDB. And also, as we emit metrics and logs out of our solution, those also have to have tenant context. Because, you know, Brian is actually using our solution right now. And we’re writing logs off. And then Brian has a problem in production, we have to be able to troubleshoot that and come back later, look through a single viewpoint and see all of the different logs that are happening, but then also D aggregate those, look at just the logs that were set up for Brian, and determine exactly what issues and experience that Brian’s having. Now, I’m leaving a lot of open questions here. As I said, the shared solution can be a little bit more complex, it’s a whole hour talk in and of itself. For now, let’s just assume that tenant context is being passed through our system. It can be passed as input parameters inside our set functions. It might be a version of our auth header, that jot token, we might pass that around, or in our API gateway and lambda authorizer, we might actually extract that tenant token and just put it into our headers and pass that around the few different options for how we can take a look at that. But for now, I don’t want to belabor the point anymore about how SaaS and step functions work together, I want to hand it off. We’ve got some other great speakers coming in here. And I want to let have them have the opportunity to show you some of the ways that they’re implementing step functions in SaaS.

Andres Moreno 18:28
Thank you. So tying into Bill’s onboarding scenario, right? There’s, sometimes you want to bring information, and as part of your onboarding from another system, or from where they stored their information. Right, so I’m going to be talking about how you can use step functions to migrate that data into your SaaS application. little introduction, my name is I’m going to smoke general. I’m a principal Software Engineer for Tyler technologies. I’ve been doing software engineering for over 13 years, with focus on the cloud and server lists for about six years now. So suck about a little bit how this looks right. So there’s usually other systems even if that’s Excel, right, but you capture your information. And you want to move to SaaS application, or you’re offering a SaaS application. And you want to offer as part of the onboarding, they can bring in their information, whether that’s looks at Excel, or you have a JSON CSV, or from a separate database. So a recent example, where I use something like this is from MIT, right? So if you know what meant is, it’s a budgeting app that got sunsetted by Intuit and I had to go I use it a lot for my budgeting and my looking at my spend and all that so I was looking for solution, but I didn’t want to lose all the history. I’ve had with so many offered an export for CSV, which another app. I know copilot is used way too much nowadays. But there’s an app, a budgeting app called copilot that allowed us to get a CSV export from Intuit and import it to copilot, right? So offering this type of import into your SaaS applications, and allowing users to get that information into your application. super valuable, because that’s why I went with this application so I can keep part of my history. So what I want to talk to you about today’s several examples that I’ve used, where I’m migrating data, whether that’s importing data for a customer into our database, we use DynamoDB. So that’s from an on prem database, then there are situations where your data models or access patterns are changing within your system. And whether whether you want to lock yourself down to that data model, that’s your option. But usually, you would want to migrate that and get it into an more optimal data model. So you have a DynamoDB to DynamoDB migration. And another one that we use pretty heavily is reindexing, open search data for your from your DynamoDB tables. There are situations where you’re capturing some information, but it’s not truly searchable yet. And if you want to then search by it, you don’t want to lose all the information you already captured from the search. So you’ve got to trigger sort of reindex. And make sure you’re mapping that as part of your search. So we’ll go over examples of all these. So let’s go over the first one, right, you got an on prem database, usually a big customer has their own database and their own data center, and it’s pretty locked down. So what you would have to do in this situation is you would export that data outside or within the their prime, but you would transform it into what you need as JSON. And in my case, since we’re importing to DynamoDB. Once you have that information, you can then go into your AWS account and send that into S3. What can we do that, right? So S3 offers S3 events, or there’s a native event bridge event that you can tie into when an object is created in s3. So we’re going to tie it to that. And our s3 solution will get once we dumped this information into our s3 bucket, we can now trigger a step function as a target for that event bridge event. So you can see how we can start tying in the data, and then get us to the real migration and step functions. So how does this look and step function. So Bill showed a little bit of an example of the JSON or the CDK. This is printscreen from workflow studio, where we’re ingesting the data, right, so we’re getting an event bridge event. So how an event bridge event looks like this. I’ve trimmed it a bit, but the important pieces are, you have a bucket name and an object key where the data lives. Previously, you would have to load this JSON, and then call inline map, we now have the distributed map here at the top, which is basically a loop that can run multiple concurrent executions at the same time. So there’s two types of map, there’s the distributed map, which is a lot more scalable, you can trigger it with up to 10,000, concurrent executions, you obviously have to protect your rate limiting and throttling are the underlying services. And then there’s the inline map, which as the name says, runs within within that execution. So why I recommend using the distributive map for migrations is because each iteration in the map runs as an ISO runs in an isolated execution, allowing you to not hit the there’s two limits that you usually hit when you’re doing migrations and an inline. And that’s the state transition history limit, which you have up to 25,000 transitions, and the state limit or the state size limit. So by abstracting or isolating each iteration into their own execution, you can allow for greater migrations without hitting that those limits. And one thing we can do with the inline map is we can point it to an s3 object where the information is. So we already gathered the data transformed it to JSON and put it in s3. And now we get a JSON object that we can load directly into the distributed map. And it’ll handle all the iterations for us. So now we can walk the workflow per each item in our array. And the first thing you can do and this is As part of what I do to protect from overwriting any existing data, is we check if we’ve already imported this data into our system. And if if we find that we have, you can handle it as per your use cases, in my case, we skip it. And you can add some handling logic there to add a metric or add a log or notify someone that we skipped. But if we don’t find it, we then go on the branch where we want to actually add the data to the database. But we usually want to validate because well, I at least don’t try don’t trust my coding abilities. So I end up making mistakes a lot of them. So I want to validate that the data coming in is actually valid. So in my case, we’re using JSON, I’m using JavaScript. So we can use a package like HIV, where we provided a schema to validate against, and we just give it the data and tells us we’re good or not. If we’re good, we go and add that into our DynamoDB table. And we’re good to go. If not, we can have some handler logic, again, based on your needs, you can either send an email, or add to a queue and whatnot to be able to handle that invalid data. A lot of what you see here is well step function is our piece of compute. And we can directly integrate with services. In our case, here, we’re directly integrated with DynamoDB. But you can also integrate with many, many other services that they offer. And it doesn’t all have to be lambda functions, something that can improve your performance if you want to reduce the load and the compute for for long running migrations. So jumping into a second example, and is the one I talked about from DynamoDB to DynamoDB. I have two examples for this. So we’ll go through the first one where our access patterns have changed. And we don’t want to stress about it and work around it within our same our current data models. So we’ve done a major data model change to be able to optimize our access patterns. How does this one look? So we have our DynamoDB table. And all AWS services offer some functionality that might be helpful or not. But in DynamoDB, there’s a direct export from DynamoDB to S3, and that’s a click of a button, it’ll get you all your data into S3, it takes a while, but you get it all there. So what we can do in this situation, is gather the JSON from the DynamoDB. And it’ll get it into S3, right automatically with built in functionality by AWS. Once it hits, it hits s3, we can follow the same process we did before, where we have our S3 bucket trigger and event bridge event that triggers a step function for us.

Andres Moreno 28:07
How does this step function, if you can see the pattern is very similar. If you look at the middle, we have the same steps that we had in our previous example. So you can see you can reuse this, these patterns for different types of migrations that you’re running. But we have a couple of steps that we’ve included here. The first step is to process a manifest file. So when you export from Dynamo DB, it’s going to actually export the data, but it gives you a manifest where the data lives because they are kind enough to partition or split the data into different files. So if you have 20 million items in DynamoDB, you’re not going to get a single item with all them. And that allows us to be able to paralyze a little bit more and handle the load. But the problem is that manifest is a list of object keys or a list of JSON objects with object keys. But it’s not a JSON array, which is what we can feed into the map. So we need to process it a little bit so that we can get our JSON array to feed our map. In this case, I’m using an inline map to loop through these files, right these object keys so that we can then give that object key to our distributed map where the data actually lives. And once once we’re there, we follow the same process, the distributed map will load the actual JSON and validate if it’s there. If it’s not, it will then in this case, we’ll transform it in line here, not like we did on the first example, validate the data and add our item to the database. And here it depends how your organization functions and what liberties you have there. But there’s another way we can do this by ISIL Getting everything within the same step function. So let’s go through the version two I have for this, where we’re actually exporting our database, or that we want to migrate into within our step function. So here at the top, we see a few more steps where instead of me having to go into the console, click a button to export, and then trigger it based on the S3 event, I can do it all here. So we trigger this export table to point in time, you obviously need to have point in some time recovery turned on for your database table. To be able to accomplish this, I highly recommend that for any of your databases. But what we did, this is an async process. And there’s no event from DynamoDB to let us know that it’s done. So what we do is we trigger it, we wait for a bit. And then we describe the export. This is another action that they provide. And from their API’s, we describe the export to get the status, if it’s complete or not. Right, so if it’s not complete, we go back and wait again, and loop until it’s complete. And then we follow the same process. Once it’s done, if it’s complete, we process our manifest file, iterate through the object keys handed over to the distributor map, and do the same steps that we had done. So you can start seeing how you can mix and match, right. So AWS has a set of services that function as building blocks for us. So you can start manipulating these and providing more metrics, more reliability based on your needs, as you start putting more of the building blocks within the step function. So I do have a third scenario, this one we use quite a bit in my company where it’s, we’ve been capturing this name, let’s, let’s say we have a name. And we’ve only index first name and last name. And that’s what we search on. But we’ve always captured middle name, but we haven’t made it searchable. And now we want to. So we want to be able to get all that information that we already have an index that middle name so that we can now search on it. So I believe it’s a pretty common use case that we use, see once you provide something like a search. So again, building on top of the same pattern that we’ve seen all along, we’re exporting our table to get all the information. In this situation, yes, you could you can do with Dynamo DB scan and all that but you run the risk of hitting throttle limits and DynamoDB and things you have to be taking care of. But once the export is complete, we process our manifest. And in this situation, we what we’re doing is we’re creating a new index that we’re going to use for the new search. And we’re iterating through the data, we see a lot less steps within this map, because we’re handling all of it within a lambda function. But we’re looping through all the data. And now we have this new function that will index the data as we need it now for our new search. So as we’re going through, we’re re indexing the data and our new index. And once we’re done, you can then kind of like you point a DMS route to the new application, you point the index or your search to the new index. And now you can search on that middle name that you didn’t have previously. There’s a lot of protection measures, you can add into these step functions. And a lot of things you can do with SQS queues to reduce the amount of downtime we have with these migration patterns. Definitely something I could talk further about. But that takes more than a lightning talk to get into the weeds of all that. So I just wanted to provide a few examples of how you could migrate data from different sources. And get your your murders going to see what ideas you can come in with for your own migrations that follow this similar pattern. So thanks a lot. And I’ll hand it over to the next speaker. Thank you. Awesome.

Jason Wadsworth 34:17
Thank you. I know we’re here to talk about step functions. But I’m going to start things off by talking a little bit about AppSync. So AppSync, for those of you who don’t know, is a GraphQL service on AWS. Actually a really great service has a lot of power to it. But one of the things that I really liked about App sync is what they call direct integrations. And so as you can see here, in this example, But it’s absent connecting directly to DynamoDB. The nice thing about this is it removes the compute layer, right? So I don’t have to introduce a lambda in the middle of that, to ultimately make that call to DynamoDB. And there’s several advantages of this reduces some costs. Because I don’t have a lambda invocation reduces some complexity, I don’t have to worry about code deployment don’t have to worry about updating patching code, and actually can be better performance, right, I don’t have cold starts for one and any, any other things related to lambda, they’re just gonna slow this down, right. So a lot of really great advantages to this. As somebody who builds multi tenant SaaS apps, and kind of mostly in the model that Bill talked about earlier in the pooled model, this is something that I look at with a lot of MB, because I can’t really use this. And the reason why I can’t use this is, as Bill pointed out earlier, is I want to keep my isolation as as good as possible, right, I want to make sure that those connections to DynamoDB have the right permissions that only the only data I can get as for the tenant that’s been requested. And that’s not possible with this integration, because app Sync has complete permissions to all the data in DynamoDB. So this is kind of the model that we typically use. And again, this is very similar to what Bill was showing you earlier. But I’ll point out a few differences. There are a few details here that bill didn’t have time to talk about. So we have a custom authorizer, that custom authorizer is doing things like validate a JWT token. But ultimately, the key thing here for us in the model that we’re using here is we actually make a call into STS to get credentials for that particular tenant. So if this is a request coming in for JSON, the token that I’m going to get or the the credentials that I’m going to get from STS allow me access to only Jason’s data, we take that that credential, we pass it back in through app sync, app sync passes it along to the lambda function. And ultimately, this lambda function is going to use those credentials to talk to DynamoDB. We don’t give the lambda function permission to talk to DynamoDB DB itself, it has to use these credentials. Now, this is really great, as far as data isolation is concerned in a pooled environment. But of course, that was the thing that we lost was the ability to do those direct integrations. And so you know, something I’ve always been looking for, like, is there a solution for this. So this is where step functions comes into play. Last year, the step functions team introduced this concept of cross accounts, tasks and step functions. And so as you know, as we talked a little bit about step functions, and their ability to do orchestration, everything within step functions before this, all had to pretty much be within the same account. There are exceptions to that there are some things in AWS that have resource policies that would allow some cross account capabilities. And interestingly, DynamoDB just added that last week, there’s, there’s some pros and cons to that approach. But but but adding this capability that the team added last year, you can do this on anything, right. So effectively, when I want to make a call to something on a different accounts, I give it a role and say use this role. And I will assume that role, and I’ll get credentials with that role, and use those credentials to do whatever task I need to do in the other account. So again, this is a really great feature has a lot of capabilities. As matter of fact, from an onboarding perspective, there’s a lot of things you can do here, if you’re doing kind of a multi account orchestration lot, some really great capabilities here. But as I was looking at this, I thought, you know, there’s nothing about this that says, I have to go cross account, right? At the end of the day, it’s just a role that I’m assuming, can I take advantage of that to give me what I was looking for with direct integrations with with App sync. And they’re kind of is. So this is what I came up with, and the solution that I’m actually working with today. And so what we do here is slightly different than what we had before, we still have this custom authorizer. And before you jump in and say, well, you’re still having to call lambda all the time, that custom authorizer isn’t called an every request, or at least doesn’t have to be, you can cache that. So it’s based on the authorization header, and it’ll call it when the authorization header changes, or whenever you have a timeout. So we typically have ours timeout at like five or 10, maybe 15 minutes. And so that custom authorizer is only called on infrequently as far as every request is concerned.

Brian Rinaldi 39:20
But what we do differently here is we don’t actually make a call into STS anymore, we’re not getting credentials, we’re just getting information, validating the user. And then we’re getting information about what role we want to assume. Now I originally thought, you know, I can just have a roll that’s named something with the tenants identifier in it. And then I can just, you know, create that string anywhere in the step function and, and we’ll be good. I decided not to do that. Because the risk with that is that I can build that string incorrectly, right? I can have somehow along the path, I could be in my step function here and have some something in the code that says Build the string based on the wrong tenant. And it would allow it to work But by virtue of having this custom authorizer go out and find what role it is, I can pass that role information into the step function. And the step function now uses that information that was passed into it, to assume a role, it’s gonna again work just like it did with the cross account stuff. It’s just taking whatever role you tell it to do, it’s assuming it. And now it’s going to use those credentials. And so I’m able to use those credentials to talk to DynamoDB. Again, all with the protection of only being able to access data from the account or the tenant that I’m working with, instead of being able to do it for every tenant. So by virtue of this, I’m able to now step back and say, All right, I have this direct integration with App sync, talking to DynamoDB. And I no longer have to have code now. I’ll tell you up front, we’ve lost one of the the advantages of direct integration. And that is cost because the step function, express workflows, they are probably, generally speaking similar in cost to lambda invocation, the exact costs will vary depending on exactly what you’re doing those step functions. But not you know, it’s not, it’s not a one to one, but it’s usually not a significant reduction in cost. But we do gain an advantage. And that is that you can have some logic added here that it’s a little bit harder to do with the direct integrations or some amount of logic you can do there. But this actually allows you to to whatever amount of logic he wants you within the step function, to maybe have some if logic in there some conditions, to make your workflow happen and do different things based on what data has been passed that are being passed back. So there are some advantages to it. That said, there are some things to be aware of. The first and foremost is that every tenant in this model has to have its own role. And this is a pretty significant thing. If you think back to the original example, when when we were having that custom authorizer going get credentials for us. And that example, what we typically do is we have a single role that is used, and then we have a dynamic policy that we have, that we assigned at the time of assuming the role. So I only have one role to manage. And then I have some code to manage that applies a policy in this world, I can’t do that because step functions requires a role, it doesn’t allow me to attach a policy on top of it. So every tenant has to have its own role. Now, there are two significant things to point out with that there are hard limits in AWS and the number of roles you can have in an account. I believe that number is 5000. Today, don’t don’t quote me on that, but I think it’s something like that. So you got to keep that in mind, right. So if you have a really large tenant base, you need to be aware of that limit, if you’re gonna go with something like this, that doesn’t mean you can’t do it, it means you need to have a multi account strategy in place. And I think I’ve actually heard Bill talk in the past about this concept of tenant sharding. And being able to spread your tenants out across multiple accounts in AWS, there’s a lot of things that I’m actually working on some stuff in that space, too. That’s that’ll be kind of interesting to talk about at some point. But but there’s a lot of things you can do to really still manage this and still use a technique like this with a large number of accounts. But it’s something that our tenants, but something to be aware of. The other thing that’s really important to keep in mind here is that if you made changes to those roles, if you need to make changes to them, instead of having one role to manage, you now have as many roles as you have tenants, right? So before, like I said, I had one role. And if I needed to make a change, I made a change that role. And then I changed a small piece of code deployed those two things, and everything was good. In this model, you have to make sure that you have some means of updating that role across all your tenants. Now, interestingly, step functions can come into play there as well. And working on some examples on that there’s, there’s some examples, I think the SaaS factories put together that do some of that as well. And so it’s really just having some means of managing tenant infrastructure, across your entire user base enroll these roles, we need to be a part of that. One other downside to this is that the DynamoDB mapping isn’t quite as simple as the app syncs direct mapping. So when app sync makes a call into DynamoDB. When it gets the data back, it kind of hides the DynamoDB-ness of a table of that data. And what I mean by that is, if you’re familiar with the way DynamoDB actually stores its data, it’s not just like the raw JSON that you might work with in a JavaScript code or any code, it’s, it’s actually goes down to like, the type of the data, right? So if it’s like first name, it’s not just first name, its first name dot s. And it’s not just last name with last name dot s. And so that that whole structure, the direct integration with App sync hides that from you automatically. unmarshal zip is what they refer to it and turns it into kind of that original JSON. You don’t get that same capabilities with this with this option, unfortunately, and there’s not like utilities for that that exists yet today, would love to see those added but they’re not today. So the downside is that that means that every field you add does need to have explicit mapping. You know, how big of a downside that is? Depends on how often your data model changes, right? If it’s not cheap Very often it’s not a really big issue, but it’s something that is changing then something to consider is whether or not you want to go with this option. But at the end of the day is something that actually works pretty cool. I’ve been playing around with it, we haven’t actually rolled this out yet, largely because I’m working on some stuff around the change management of those roles. And so once we get that in place, we will start using this. But it’s a cool option to kind of give us some capabilities to reduce, you know, again, reduce some cold starts, as well as, you know, cut down some some code that I have to deploy that I don’t have to worry about maintaining. So some capabilities that are really nice, I would love to see app sync, kind of take this ability directly into AppSync. But until it does, step functions is a great way to get us there. So that’s all I have. Thank you so much as some information about me. Just Wadsworth, and, again, thanks for having me.

Seth Geoghegan 46:04
Thanks, Brian. Hi, everyone. I’m Seth Geoghegan. I am a software developer at nerdy and today I’m going to talk to you about kind of the SaaS onboarding piece of like our journey, I guess, I would call it is the SaaS onboarding. From manual to automated using step functions. I think Bill, and Andreas actually did a really fantastic job of teeing up some some core concepts that I’m going to refer to in my talk. So it’s, it’s kind of really cool to see what everybody’s a building, and thanks for setting that foundation. I want to start a little bit by describing kind of how we started this journey, right? And maybe, maybe this resonates with some people on this talk that, rarely do we start with the total end solution fully figured out, right. And so in the beginning of our application domain, we had our application plane, right. This is years ago, by the way, where we had our web tier and all of our like little microservices kind of floating around in our in our application plane. And this is what Bill referred to as like the pooled architecture, right? And we didn’t, or we had like a monolith application, there was a typical business, B2C business to consumer use case that provided a self service sign up. And the application plane was kind of responsible for managing that onboarding user experience. Right? It wasn’t kind of teased out until what what we kind of in the SaaS community call the control plane, right. And this was kind of in the beginning, right. But then we started growing as a company. And we had some new business opportunities and new customer opportunities that kind of forced the issue of how to grow and scale the onboarding experience, right. And so in this business to consumer model, where we just have like single single users coming to our application and signing up, we now have businesses coming to sign up. And that is provisioning can be 1000s of users at any given time for that for the onboarding experience for businesses, right, or, as our kind of tenant definition changed. And so we were faced with this, this concern of like, how do we scale our self serving self service, rather onboarding story to the b2b use case? Right? For the for the enterprise concern? And how do we kind of more simply stated, how do we do that in bulk? Right, all right, with our customers have 1000s 10s of 1000s or greater of users to provision in our system. And I think this is what I’m saying, Bill, kind of teed this concept up nicely is that what we were kind of identifying is that we need this concept of control plan, right? And Bill covered it, I think, in greater depth here. But the idea is that there’s this common set of services that we can or concerns that we can start teasing out of our application plane, and kind of establish in the home for all of the kind of those shared concerns that every kind of SaaS application needs to deliver. Now, in our case, we are talking about an admin experience, right? So this is an internal tool, our customers are not interacting with this directly our business customers. And so this is our own internal tooling that we that we built out to kind of facilitate that onboarding story, right. And so there’s all kinds of concerns in the control plan, right? There’s identity management, there’s the onboarding picture, there’s, you know, there’s, you know, tenant management, you have an admin experience. But today, what I want to spend this time talking about is specifically how we use step functions to implement that onboarding experience. Right. And we had kind of a few high level goals here, right? For this for this use case, we wanted to be able to provision users in bulk, right, our B2C model was just a single user, but now we have to do that at a larger scale. So and we need to do so by integrating with SSL providers, right? We want to be able to detect new customers automatically, right? So these business customers would go into their SSL port fighters add our app as a trusted application. And we want to be able to detect that automatically and kind of kick off our internal processes for provisioning. Another challenge is we wanted to keep that data in sync, right? When customers manage their user management system and add and remove users, we wanted to make sure we mirrored that in doesn’t have to be real time, but we in an automated fashion and in our environment. And finally, of course, like any good onboarding experience, we wanted to have a nice admin user interface for our internal users to use. So, you know, that’s kind of what our goals were. And the path and journey there isn’t always so. So immediate or straightforward, right? So step one into kind of teasing out these concerns, was very unlike where we went to with step functions, right. And it was implemented with AWS batch jobs, right. And this is an audience of serverless and step functions. And so why am I talking about AWS batch jobs, right? That was the first stab at kind of teasing out that concern of onboarding, and kind of achieving some of those goals that I mentioned earlier. fetching data from these third party SSO providers. Right was a long running process as it was implemented initially. And so batch jobs seemed like a good fit at that time. But this suffered from some drawbacks, right? There’s a human orchestrator, right, this operator that’s doing this onboarding experience, is manually executing these batch jobs, right, and making sure that they happen in the right order and monitoring them and watching them. But you know, honestly, it worked right out the door, version zero, as I call it, which was quite a while ago at this point, but version zero checked the box, right? We were able to onboard these customers in bulk, and pulling down all these pulling down all these users and provisioning all the resources and in our application plane as we needed. But it was very manual, right? We needed to keep things in sync, right. So the human operator had to run this again, and again, and again, it worked fine, when the number of customers we had was small. But as we scaled, this was just not a workable solution. Right. So now is when we get into what did we do, right? So version, one of this introduced a step function workflow, right. And this was kind of the early phases of us getting getting our feet wet, and using step functions in the user onboarding or the tenant onboarding experience, right. And so it was kind of a lift and shift, honestly, we took these batch jobs that that humans operated. And then we move them into a step function workflow and orchestrated them in the correct order that they needed to be right. And very similar to Andres talk, the way that we kick this off was an S3 bucket, upload your JSON. And then you could, you know, turn through that in your step function, workflow, and do the actual onboarding process. So this was an incremental improvement. But it actually delivered a lot of value, right? So no longer was the human in control of operating this onboarding process. It but we had this in the step function, and that orchestration happened for us. But it was still high friction, right? This is not the ideal state, this was the Let’s Move this over to get it in an automated state. So we can actually build out what we would prefer to have longer term, right. And, by the way, this is what the actual step function workflow on viz, one look like, right? It’s not not not terribly fancy, right? It was kicked off by an s3 bucket, object created event right through event bridge, as Andreas was mentioning in his talk, we take a configuration right to the top of the workflow, we parse it, and then pass it on to each batch job that did their own, did their own piece of the onboarding puzzle, right. But now, we were set up to actually evolve this solution. And able to think out properly what we wanted this end to end experience to look like for our for our administrative users. And so there’s a lot of services here. I hope this is big enough for you guys to see it. But now that we got to the point that we can automate that experience and free up, free up our time to focus on building a better end to end solution, we we kind of came up with this, this idea. So we have an admin console that lets the administrators kind of configure and manage tenant registration through an app sync API. And we introduced event bridge scheduler to actually take care of the periodic syncing and rethinking. But the real key I want to point out to here to you all is how we use step functions in this workflow. Right? So the experience is you go to the administrative interface, you log in, you configure a tenant in that configuration has a schedule related to it in terms of how often the data is synced. The credentials for the SSO provider and so forth. And so this is So what we’re building and I’ll take you on to some of the step function workflows that facilitate this process.

Seth Geoghegan 55:07
Now, one of the goals that we had is that we wanted to be able to automate the discovery of new customers, right? Because upfront, as I said, our application is available in various marketplaces, somebody would add our application, and we would get a notification via email, right? So a human being had to go take that and go do something to kick off our onboarding process when it was more manual. But we thought, gee, it would be really great if we could automate the discovery of this process. And this step function workflow that I’m showing on the screen was a really, really simple way to do this. We have multiple SSL providers, and I’m choosing just an example of one of them. But they do expose an API to let us discover which customers have granted us access on their platform, right. And so the way the step function workflow works is upfront, it uses the the top rather, we fetch from that API, the list of all the customers that we have access to right who’s given us access, we also at the same time fetch the last time this workflow ran from a parameter store value, it’s kind of a little hack for us, we didn’t need a full DynamoDB table to store this information parameter store worked fine for this use case. And then we use this map state right to iterate over the results of the request to this third party API. And as we iterated through there, we could check a timestamp, basically, about when that customer was added. The permission rather was added to our, to our account, so we have access to that data. And when we found something new, we would publish an event in event bridge. And we would kind of just right, you see that I’m just calling calling it old on the other side, a past state basically skipping when it is not a new provider, right. And so this is like a really simple step function workflow. But it’s got a few key things that I wanted to point out. One, it’s using the new call third party API task. From step functions, which if you haven’t used it is a very cool feature of step functions, where you can push off the management of the API call and the management of the actual credentials to use that call off to AWS. So the step function workflow just manages that for you, and then takes the output and passes them on to your next state. And that was like a really kind of key aha moment for us that a lot of our integrations were with third party providers. And we were able to leverage the call third party API task in several of our workloads. And also, we’re big fans of serverless first, so being able to use parameter store, and have a no code workflow that identifies when customers give us permission to to interact with their account, and to pull down their data really set us up for a nice automation for the onboarding experience, which is what I’ll show you next. When we discover a new customer, in our environment, we do this kind of initial sync of their data, right, and what I’ve been calling them kind of businesses or enterprises, but in our domain, we are in the education domain, our customers, our school districts and schools, right, and the kind of information that we’re pulling down our student account information, teachers, which schools are in the district, that sort of thing, right. And we use a step function workflow, to kind of orchestrate the calls to these providers, and these many providers, so many workflows, to be able to pull that data down and provision those resources in our environment for those customers to log in and have the SAS experience that we provide, right. And another like, similar pattern here that from the last screen is that we’re using this call third party API task, right, and you’re gonna see this all over the place in these workflows I’m presenting. But the very first step in this workflow, and I’m zoomed in to just a very tiny slice, I’ll show you the bigger workflow in a moment on the next screen. But the very first step in this workflow is to fetch the credentials from the provider to identify this particular customer, right? It’s done via token, right? Our application has permission to reach into that API and fetch credential specific to that provider or procedure to that customer. Next thing I do is I enter in a parallel state where we use lambda functions for this I can get into why that in a moment, but the lambda function just does the API request to the provider handles the pagination, and writes the response to s3 buckets, right? And again, Andreas is talk really kind of teed this up nicely for me, because step functions really makes iterating over JSON arrays. So simple, right? So you can see the output of a lambda function is actually a bucket name and object key that we just enter into a distributed map state, map over the response from the API and put it into DynamoDB. And so this kind of end to end flow could be used. Or there’s I’m sorry, this full step function workflow could be used to onboard this entire customer base. And it could be triggered by the prior workflow that when we detect somebody has, or some company has granted our app application permission. And here’s what the broader workflow looks like, right? It’s actually, I think, hopefully, it’s not too much for one screen here. But this is what the general process looks like, for the initial data sync of our customers when they join our platform, right. And this logic here, although maybe looks hope it looks simple, represented an entire AWS batch job function that had a little bit more management overhead, right, and the observability picture looks a little bit different. But this is a very kind of modest workflow for the value that we’re getting out of it in terms of automating that full end to end onboarding experience. To give you a sense of the volume of data, this is processing these types of customers, depending on the number of schools in a district, and the number of schools in that I’m sorry, students in that district could result in 10s of 1000s to hundreds of 1000s of entries in our database. When we on one board, there’s customers. But the key kind of component of that is the user onboarding experience, because it lets us provision downstream resources for those users. And lastly, I wanted to show you one of the goals that we had was, how do we keep this data up to date, right, because one of our one of the problems that we had to solve for was our customers maintain their own data source or their own. They call them student information systems have their school rosters, which students are in which grades that sort of thing, right. And that changes all the time. And so we needed a way to make sure that our data stayed up to date and reflected that source of truth and the customer side, right. And the way we answered that, and v zero of the solution was just manually re executing this on a periodic basis, some customers quarterly was fine, some daily was fine. But certainly, we would need to do at least the least on this school, your boundary, right? We thought it would be great if we could write stuff on short flow to automate this piece. Luckily, this particular provider had an events API. It is a polling mechanism. So we have to pull it. But we are able to get basically an event stream of changes that have happened over time. And this just teed it up nicely for us to do a step function workflow. I’m showing you one workflow that updates the state of schools. You can think of this as a Change Data Capture event of streams, or I’m sorry, a stream of events that we get to pull down from a REST API. We have this type of workflow for each type of entity that we manage in our system or onboard into our system. And again, you see the call third party API integration at the top, where we’re getting a stream of events from this provider, we iterate through those events, they are they are ordered by time, and then we apply them to our system in order. And there. In this case, you’re either updating the school’s configuration, or you’re deleting it because it was removed from the from the provider side. And so again, like a very simple no ko workflow to solve like a key problem in our, in our in our kind of onboarding story.

Seth Geoghegan 1:03:30
So like just taking a step back and kind of wrap up, like what we achieved in this migration. My intention was to show you that, that in this kind of SAS journey, it is a journey, you kind of start in one place and slowly evolve your solutions over time. And for us introducing step functions into the onboarding experience into our control plane, right, really, really reduced the manual operation, right that like high high manual burden for engineers, and our organization to operate that early version of that solution. Right. And one of my favorite things as a service practitioner, maybe there’s some of the people on the call like this is that it reduced code. That is like the best feeling ever, when you go delete something that you no longer need to manage yourself, right? Shout out to the step functions team for having the call third party HTTP API integration. Very cool. I have a slide on my last slide, which is next about just kind of a shout out that particular feature and what it meant to us. Also the native SDK integrations and JSON parsing through our CSV parsing through s3 buckets, right? Like all that functionality built into stuff, function workflows that we just don’t need to manage. It was a huge win for us. And I think it’s maybe an overlooked feature here, but the improved observability right, we can go into any one of the onboarding processes and kind of understand exactly what’s happened at each step of the integration with failure and fault tolerance built in is just a huge win for us. And honestly, the developer experience. As Bill said earlier, if you haven’t experienced step functions recently give it another look right not only from the provisioning side from CDK. But the new tests sate the new HP, integrations with third party API’s just really set us up for some really fast ephemeral deployments, which we do on our CI CD pipelines, versus our existing kind of v zero solution, which was more of a container workload, and just a little bit more friction to the solution. So that’s what we’re able to achieve with these orchestrations. And just last shout out, if you haven’t seen step functions in a while, this is one of the things that just really dramatically improve the experience for us. And it’s this call third party API, and the new test state functionality. This was awesome. So I have a little snippet there says, you know, get school events, that’s the API call that we were making to this provider, it’s all configured credentials are in secrets manager. And I can just highlight that task, click the Test state button, you see a screenshot there of what that looks like. And it just executes that request, and puts the response back and on the window, you can kind of explore it, look at all the all the attributes that are being sent, along with the request just such a cool experience. So just a special shout out to that new feature or new ish feature. And it was one of the things that really kind of unlocked our ability to like really remove a lot of code and really kind of lean in to the step function solution for this onboarding kind of use case. And that’s it for me.

Brian Rinaldi 1:06:38
Awesome. Okay. And the Add everybody here. All right. So I know we’re technically over time. So I don’t think we spent too much time but I do think, like, an obviously under this heading and, but, but I’m not just watching you all, I’m amazed at how the different different types of solutions and like there was, each of you is solving very different problems, similar in some respects, but like, you know, they there was some overlap, but they they’re very unique. So like, it’s amazing how flexible this ends up being, to meet all of your use cases. Yeah. So, um, you know, just if anybody has any, like, now that you’ve all seen each other’s presentations, I mean, I’d love to hear anybody’s like your final thoughts. And like, what were some things like, was there something in particular in somebody else’s presentation that you’re like, oh, that really stood out to me. It’s cool. Bill is eager to go. And he’s muted, too.

Bill Tarr 1:07:40
It happens once a day, no matter what I’ve managed to start talking on mute, whether it’s my new life, my new mission, just one of the last things I said, really jumped out at me. And I’m kind of ashamed for not even saying it earlier. Those HTTP API’s that they’ve added in a step function, the ability to call third party API’s is a game changer. For a lot of people who are onboarding SAS. I think there’s there’s, if you think about all of your different flows in SAS, you might be using third parties like Stripe, you might be using LaunchDarkly is so many interesting API’s. Netlify was one of your sponsors there, Brian. And there are another one that has very interesting API’s that you could call out to, as you’re thinking about your product and how you built your application out. So it’s, it’s a game changer. When I pulled the community for what was the most popular SAS related announcement at reinvent hands down. The HTTPS API is one the day across the board. So many people are playing with them right now. It’s just, you know, some things, some simple things can just make your life so much easier. And it’s a perfect example of how a simple feature release, which I almost didn’t notice before everybody, like it was like, Oh, cool. I’m sure that’s fine. And then everyone started talking about us like, oh, wait, no, no, that’s a big, big deal. Like this is going to change things for a lot of people. So again, if you haven’t played with step functions in a while, they’ve changed a lot that that team is really killing it in the last year, as far as I can tell.

Jason Wadsworth 1:09:05
Yeah, like to expand on that. I think, in general, I think that, like you said, the team is really killing it. They’ve done so many things. To reduce the amount of code you have to write like, Yeah, before that the all the SDK integrations, like that’s huge, right? Like, being able to call into the services that every time before I had to write a lambda function to do that. And you know, they had a few edge cases, right? They had a DynamoDB integration, a few, you know, a few integrations that were that were there that were like they’re kind of native, whatever they call them. But once they introduced the SDK, it was like, man, things took off. Like overnight, you had 100 different possibilities of things that you had to use code for in the past that you no longer did. And this API thing is just another step in that direction of one less place, I have to write code. I get retry logic I get you know, I get like all those kinds of things that are that are having to be done elsewhere and a lot more complicated. Just done for me.

Seth Geoghegan 1:10:00
Yeah, it’s a really compelling offering, like, like, the more than I mean, the more features that are getting added to it, the developer experience is really changing, right like that, like I said, that test state kind of was worth like a call out as a final slide, at the very least, that if people aren’t aware of this, being able to go in and highlight a single state in your workflow and invoke it, like it, like outside of the context of your full workflow was like another one of those, like, small features, but a game changer, right. And, and that’s true of a lambda invocation, or an API gateway in the integration or whatever it is. But just the ability to kind of hone in on that very, like one simple task, just really changed the experience for us and honestly, pushed us more into the direction of a step function workflow. For some of these workloads, it just made it a no brainer.

Brian Rinaldi 1:10:48
That’s, that’s cool. I mean, it’s amazing to me, when you think about like, you know, software companies, we tend to make these big releases with these big bang features. It’s like, you know, it’s like, Oh, that’s really cool. And I was like, find the big features you sometimes see, once you dig into, you’re like, okay, yeah, that’s all right. I might use that once in a while. But this these little things that makes such a big difference, because that’s kind of things that, like, You’re doing all the time calling outtakes, you know, HTTP API’s is like, something you’re going to do on a, like, almost daily basis, right? And that little change suddenly makes a huge impact. Okay, so I’m going to leave it there.

Tags

More Awesome Sessions