ec2: Enhance security group handling in Spot Fleet launch specifications#159
Conversation
| for g in (nic.get("Groups") or []) | ||
| ] | ||
| or [] | ||
| ), |
There was a problem hiding this comment.
Does AWS follow this exact precedence order?
There was a problem hiding this comment.
AWS disallows specifying all these parameters on the creation.
| @staticmethod | ||
| def _get_ami_id(ec2_client): | ||
| """Mock AMI in moto; real Amazon Linux AMI via SSM on AWS.""" | ||
| from tests import allow_aws_request | ||
|
|
||
| if not allow_aws_request(): | ||
| return EXAMPLE_AMI_ID | ||
| ssm = boto3.client("ssm", region_name=ec2_client.meta.region_name) | ||
| kernel_61 = "/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-x86_64" | ||
| return ssm.get_parameter(Name=kernel_61)["Parameter"]["Value"] | ||
|
|
||
| @staticmethod | ||
| def _create_vpc_subnet_sg(ec2_client, cleanups, sg_description="fleet sg test"): | ||
| """Create VPC / subnet / SG, registering cleanups in creation order. | ||
|
|
||
| Reversed teardown (SG → subnet → VPC) respects AWS dependency rules. | ||
|
|
||
| Returns: | ||
| (vpc_id, subnet_id, sg_id) | ||
| """ | ||
| vpc_id = ec2_client.create_vpc(CidrBlock="10.0.0.0/16")["Vpc"]["VpcId"] | ||
| cleanups.append(lambda: ec2_client.delete_vpc(VpcId=vpc_id)) | ||
|
|
||
| subnet_id = ec2_client.create_subnet( | ||
| VpcId=vpc_id, | ||
| CidrBlock="10.0.0.0/24", | ||
| AvailabilityZone=ec2_client.meta.region_name + "a", | ||
| )["Subnet"]["SubnetId"] | ||
| cleanups.append(lambda: ec2_client.delete_subnet(SubnetId=subnet_id)) | ||
|
|
||
| sg_id = ec2_client.create_security_group( | ||
| GroupName="test-fleet-" + str(uuid4())[:6], | ||
| Description=sg_description, | ||
| VpcId=vpc_id, | ||
| )["GroupId"] | ||
| cleanups.append(lambda: ec2_client.delete_security_group(GroupId=sg_id)) | ||
|
|
||
| return vpc_id, subnet_id, sg_id | ||
|
|
||
| @staticmethod | ||
| def _cleanup_fleet(ec2_client, fleet_id, instance_ids, subnet_id): | ||
| """Delete the fleet, wait for instance termination, then wait for ENIs | ||
| to be fully released from the subnet before delete_subnet / delete_vpc run. | ||
| """ | ||
| ec2_client.delete_fleets(FleetIds=[fleet_id], TerminateInstances=True) | ||
| if instance_ids: | ||
| ec2_client.get_waiter("instance_terminated").wait(InstanceIds=instance_ids) | ||
| for _ in range(20): | ||
| enis = ec2_client.describe_network_interfaces( | ||
| Filters=[{"Name": "subnet-id", "Values": [subnet_id]}] | ||
| )["NetworkInterfaces"] | ||
| if not enis: | ||
| break | ||
| sleep(3) |
There was a problem hiding this comment.
These static methods feel like they should be fixtures.
| for _ in range(20): | ||
| enis = ec2_client.describe_network_interfaces( | ||
| Filters=[{"Name": "subnet-id", "Values": [subnet_id]}] | ||
| )["NetworkInterfaces"] | ||
| if not enis: | ||
| break | ||
| sleep(3) |
There was a problem hiding this comment.
Please implement the retry fixture as in LocalStack if possible because this pattern is starting to be quite common.
There was a problem hiding this comment.
I agree that we need to implement proper fixtures and improve testing for EC2 in general, but doing it here seems like a wasted effort to me. I think we should discuss about how we test EC2 and maybe move all the tests in localstack repository but this should be a follow-up. Consider that this PR is supposed to be a small bugfix to unblock another initiative.
| "Instance has no security groups — SecurityGroupIds from launch template was not applied" | ||
| ) | ||
| assert sg_id in instance_sg_ids, ( | ||
| f"Expected security group {sg_id!r} to be attached to the instance, " |
There was a problem hiding this comment.
Why !r? Why not just {sg_id}?
There was a problem hiding this comment.
Not specific reason, we can remove the !r
| lambda: self._cleanup_fleet(ec2_client, fleet_id, instance_ids, subnet_id) | ||
| ) | ||
|
|
||
| fleet_errors = fleet_res.get("Errors", []) |
There was a problem hiding this comment.
As per AWS docs, CreateFleet response has no Errors field, but there is an ErrorSet
There was a problem hiding this comment.
huh, boto documents these as Errors 🤦
https://docs.aws.amazon.com/boto3/latest/reference/services/ec2/client/create_fleet.html
| return ssm.get_parameter(Name=kernel_61)["Parameter"]["Value"] | ||
|
|
||
| @staticmethod | ||
| def _create_vpc_subnet_sg(ec2_client, cleanups, sg_description="fleet sg test"): |
There was a problem hiding this comment.
ec2_aws_verified decorator is already capable of creating this, what's the reason it's not used?
There was a problem hiding this comment.
Good question, the reason is that it doesn't work well with cleanups. The cleanup runs after the ec2_aws_verified finally blocks, which ends up with VPCs / subnets not being removed.
nik-localstack
left a comment
There was a problem hiding this comment.
Thanks for the feedback @viren-nadkarni
I agree with most of your comments and I think we need to discuss about ec2 testing in general. Would you be willing to have a call tomorrow or early next week ?
In the meantime, can you specify if these comments are blockers for the current PR ?
| return ssm.get_parameter(Name=kernel_61)["Parameter"]["Value"] | ||
|
|
||
| @staticmethod | ||
| def _create_vpc_subnet_sg(ec2_client, cleanups, sg_description="fleet sg test"): |
There was a problem hiding this comment.
Good question, the reason is that it doesn't work well with cleanups. The cleanup runs after the ec2_aws_verified finally blocks, which ends up with VPCs / subnets not being removed.
| for _ in range(20): | ||
| enis = ec2_client.describe_network_interfaces( | ||
| Filters=[{"Name": "subnet-id", "Values": [subnet_id]}] | ||
| )["NetworkInterfaces"] | ||
| if not enis: | ||
| break | ||
| sleep(3) |
There was a problem hiding this comment.
I agree that we need to implement proper fixtures and improve testing for EC2 in general, but doing it here seems like a wasted effort to me. I think we should discuss about how we test EC2 and maybe move all the tests in localstack repository but this should be a follow-up. Consider that this PR is supposed to be a small bugfix to unblock another initiative.
| for g in (nic.get("Groups") or []) | ||
| ] | ||
| or [] | ||
| ), |
There was a problem hiding this comment.
AWS disallows specifying all these parameters on the creation.
| lambda: self._cleanup_fleet(ec2_client, fleet_id, instance_ids, subnet_id) | ||
| ) | ||
|
|
||
| fleet_errors = fleet_res.get("Errors", []) |
There was a problem hiding this comment.
huh, boto documents these as Errors 🤦
https://docs.aws.amazon.com/boto3/latest/reference/services/ec2/client/create_fleet.html
| "Instance has no security groups — SecurityGroupIds from launch template was not applied" | ||
| ) | ||
| assert sg_id in instance_sg_ids, ( | ||
| f"Expected security group {sg_id!r} to be attached to the instance, " |
There was a problem hiding this comment.
Not specific reason, we can remove the !r
Motivation
When CreateFleet uses a LaunchTemplateSpecification, security groups were never applied to launched instances because group_set was read exclusively from the non-standard GroupSet field, ignoring the actual SecurityGroupIds and NetworkInterfaces[N].Groups fields that AWS clients use.
Solution
The fix falls back through all three sources (GroupSet → SecurityGroupIds → NetworkInterfaces[N].Groups) when building the SpotFleetLaunchSpec, and two AWS-verified regression tests covering both code paths are added in TestFleetSecurityGroups.