今年一月份在 OpenStack 社区修复的一个 Nova 的 bug, 最近 patch 终于合并了,在此简单记录下。
问题
- 标题:nova api returns 500 when creating a volume booted instance with memory enryption enabled
- 链接:https://bugs.launchpad.net/nova/+bug/2041511
问题描述及报错日志如下。
Description
===========
When creating an instance with an volume created from an image with hw_mem_encryption: true, nova-api returns 500 and the creation request is not accepted.
Steps to reproduce
==================
* Create an image with hw_mem_encryption=True
openstack image create encrypted ... openstack image set encrypted --property hw_mem_encryption=True
* Create a volume from the image
openstack volume create bootvolume --image encrypted ...
* Create an instance openstack server create --volume bootvolume ...
Expected result
===============
Instance creation is accepted and processed by nova, without errors
Actual result
=============
Nova api returns 500 error and does not accept the request
Environment
===========
1. Exact version of OpenStack you are running. See the following
list for all releases: http://docs.openstack.org/releases/
Ubuntu 22.04 and UCA bobcat.
# dpkg -l | grep nova
ii nova-api 3:28.0.0-0ubuntu1~cloud0 all OpenStack Compute - API frontend
ii nova-common 3:28.0.0-0ubuntu1~cloud0 all OpenStack Compute - common files
ii nova-compute 3:28.0.0-0ubuntu1~cloud0 all OpenStack Compute - compute node base
ii nova-compute-kvm 3:28.0.0-0ubuntu1~cloud0 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 3:28.0.0-0ubuntu1~cloud0 all OpenStack Compute - compute node libvirt support
ii nova-conductor 3:28.0.0-0ubuntu1~cloud0 all OpenStack Compute - conductor service
ii nova-novncproxy 3:28.0.0-0ubuntu1~cloud0 all OpenStack Compute - NoVNC proxy
ii nova-scheduler 3:28.0.0-0ubuntu1~cloud0 all OpenStack Compute - virtual machine scheduler
ii python3-nova 3:28.0.0-0ubuntu1~cloud0 all OpenStack Compute Python 3 libraries
ii python3-novaclient 2:18.4.0-0ubuntu1~cloud0 all client library for OpenStack Compute API - 3.x
2. Which hypervisor did you use?
Libvirt + KVM
3. Which storage type did you use?
LVM
4. Which networking type did you use?
ml2 + ovs
Logs & Configs
==============
The following traceback is found in nova-api.log
```
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi [None req-f55255c7-5829-4f89-bee9-ab34a6c02faf 69d6ccfef7e240398970c80f0be8ccf7 5a2803c4cdb1412fa1e83738d7821904 - - default default] Unexpected exception in API method: NotImplementedError: Cannot load 'id' in the base class
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi Traceback (most recent call last):
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/api/openstack/wsgi.py", line 658, in wrapped
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi return f(*args, **kwargs)
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/api/validation/__init__.py", line 110, in wrapper
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/api/validation/__init__.py", line 110, in wrapper
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/api/validation/__init__.py", line 110, in wrapper
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi return func(*args, **kwargs)
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi [Previous line repeated 11 more times]
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/api/openstack/compute/servers.py", line 786, in create
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi instances, resv_id = self.compute_api.create(
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/compute/api.py", line 2207, in create
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi return self._create_instance(
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/compute/api.py", line 1725, in _create_instance
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi self._checks_for_create_and_rebuild(context, image_id, boot_meta,
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/compute/api.py", line 1020, in _checks_for_create_and_rebuild
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi self._validate_flavor_image(context, image_id, image,
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/compute/api.py", line 684, in _validate_flavor_image
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi self._validate_flavor_image_nostatus(
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/compute/api.py", line 830, in _validate_flavor_image_nostatus
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi API._validate_flavor_image_numa_pci(
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/compute/api.py", line 880, in _validate_flavor_image_numa_pci
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi hardware.get_mem_encryption_constraint(flavor, image_meta)
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/nova/virt/hardware.py", line 1195, in get_mem_encryption_constraint
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi image_meta.id)
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/oslo_versionedobjects/base.py", line 67, in getter
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi self.obj_load_attr(name)
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi File "/usr/lib/python3/dist-packages/oslo_versionedobjects/base.py", line 600, in obj_load_attr
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi raise NotImplementedError(
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi NotImplementedError: Cannot load 'id' in the base class
2023-10-27 07:46:56.878 381436 ERROR nova.api.openstack.wsgi
根据报错日志及代码分析,image_meta
缺少id
键导致报错。
具体一点是创建内存加密虚机时,会调用get_mem_encryption_constraint()
函数进行相关校验,该函数在打印相关日志时会用到image_meta.id
。但是从卷创建虚机时,image_meta
缺少了id
键,从而导致报错。
def get_mem_encryption_constraint(
flavor: 'objects.Flavor',
image_meta: 'objects.ImageMeta',
machine_type: ty.Optional[str] = None,
) -> bool:
...
requesters = []
if flavor_mem_enc:
requesters.append("hw:mem_encryption extra spec in %s flavor" %
flavor.name)
if image_mem_enc:
requesters.append("hw_mem_encryption property of image %s" %
image_meta.id)
_check_mem_encryption_uses_uefi_image(requesters, image_meta)
_check_mem_encryption_machine_type(image_meta, machine_type)
LOG.debug("Memory encryption requested by %s", " and ".join(requesters))
return True
至于为什么从卷创建虚机时,image_meta 会缺少id
键,因为get_image_metadata_from_volume
函数在处理image_meta
时忽略了id
,如下所示。
VIM_IMAGE_ATTRIBUTES = (
'image_id', 'image_name', 'size', 'checksum',
'container_format', 'disk_format', 'min_ram', 'min_disk',
)
def get_image_metadata_from_volume(volume):
properties = copy.copy(volume.get('volume_image_metadata', {}))
image_meta = {'properties': properties}
image_meta['size'] = volume.get('size', 0) * units.Gi
for attr in VIM_IMAGE_ATTRIBUTES:
val = properties.pop(attr, None)
if attr in ('min_ram', 'min_disk'):
image_meta[attr] = int(val or 0)
image_meta['status'] = 'active'
return image_meta
梳理
注:因为当时是直接在社区回复的,所以用的英文。
I conducted a test and found that using openstack server create --image encrypted
to create a instance with memory enryption enabled, it won't cause an error.
However, when using openstack server create --volume bootvolume
to create a instance with memory enryption enabled, it will cause an error.
Because in nova/compute/api.py _create_instance()
function, when image_href
is None, it will call block_device.get_bdm_image_metadata()
to get boot_meta
. But in get_bdm_image_metadata()
, it may call get_image_metadata_from_volume
to get image_meta
, which lead to image_meta
has no id
key.
Therefore, when calling self._checks_for_create_and_rebuild()→self._validate_flavor_image()→self._validate_flavor_image_nostatus()→API._validate_flavor_image_numa_pci()→hardware.get_mem_encryption_constraint()
, using image_meta.id
will will cause an error.
- 创建失败结果可稳定复现
- 使用 image 创建,image_meta 包含 id
- 使用 volume 创建,image_meta 不包含 id
修复
V1
一开始想的是,从卷创建虚机时会导致 image_meta 缺少 id,那么直接增加 id 就完事了。因为也可能用到 name,所以同时增加了 name 和 id,如下面 13 行到 16 行所示。
def get_image_metadata_from_volume(volume):
properties = copy.copy(volume.get('volume_image_metadata', {}))
image_meta = {'properties': properties}
image_meta['size'] = volume.get('size', 0) * units.Gi
# NOTE(jie): When creating a volume booted instance with memory
# encryption enabled, image_meta needs id key. See bug 2041511.
for attr in VIM_IMAGE_ATTRIBUTES:
val = properties.pop(attr, None)
if attr in ('min_ram', 'min_disk'):
image_meta[attr] = int(val or 0)
if attr == 'image_id' and val:
image_meta['id'] = val
if attr == 'image_name' and val:
image_meta['name'] = val
image_meta['status'] = 'active'
return image_meta
代码提交后,社区大佬 sean 提了一些建议。
this looks reasonable although we should not actully need the image id for the
get_mem_encryption_constraint fucntion
that should not be trying ot look up the image by id in the context fo a boot form voluem guest it should be useing the volume metadata
we should really just fix
https://github.com/openstack/nova/blob/fed123085d9fc2306833840326c1c6a93deba09d/nova/virt/hardware.py#L1191-L1200
to not depend on the id
its just used for an error message
with that said we changed to id form name in the past since id should always be present
https://github.com/openstack/nova/commit/e98994027f0af0b22277bcfaed4ab6e6f4a2c74e
so the content of this patch is not wrong either
意思就是我们不应该依赖 image_meta.id
,因为它只被用于打印错误信息。我们真正要修复的是对 image_meta.id
的依赖。
于是我回复了 sean 大佬。
Hi, Sean, thank you very much for your feedback. I think there are currently three options to consider.
- As you said, it's not relying on image.id, but using volume metadata because it's only used for error messages.
- When using image.id, judge whether it exists. Just like image_id = (image_meta.id if 'id' in image_meta else None).
- Add id key when getting image_meta from the volume, just like this patch.
Which do you think is more reasonable, or other suggestions. Thank you.
sean 大佬给了反馈:
we cant assume that the volume will have an image ID filed set as the volume
may not have been created from a glance image in all cases.
So the best approach is likely to add it as you do in the path and if there is no value set it to a sentinel value we can detect later. i.e. '
' that way if we have a volume we booted form that was created by manually installing an os using an iso or a volume created from a volume snapshot then we can still log '
' in the error if there is a conflict and we will log the correct id if not. the other approach would be to detect if its boot form volume and instead of setting "requestion is image ..."
pass "requester is volume volume uuid" or similar where this is currently being used and causing the atirbute error.
in a boot form volume case its more useful to have the volume uuid then the image uuid as the volume metadta can be updated after it was intially created form the image.
意思是 volume 不一定在所有情况下都会包含 imgae id,所以最好还是去除对 image id 的直接依赖。一种可行的解决方案是在直接使用 image id 之前判断它是否存在,如果不存在的话,可以设置为一个特定的值,比如 <no-id>
,便于后续的检索。
V2
根据 sean 大佬的意见进行修改,如下所示。
def get_mem_encryption_constraint
...
# NOTE(jie, sean): When creating a volume booted instance with memory
# encryption enabled, image_meta has no id key. See bug #2041511.
# So we check whether id exists. If there is no value, we set it
# to a sentinel value we can detect later. i.e. '<no-id>'.
...
if image_mem_enc:
image_id = (image_meta.id if 'id' in image_meta else '<no-id>')
requesters.append("hw_mem_encryption property of image %s" %
image_id)
def _check_mem_encryption_machine_type(image_meta, machine_type=None):
...
# image_meta.id is not set when booting from volume.
image_id = (image_meta.id if 'id' in image_meta else '<no-id>')
...
再次提交后,sean 大佬给了反馈,让我补充单测。
so this will mitigate the current problem
at some point we shoudl refactor this code to eihter pass a volume or instance object so that we can detect if its boot form volume and log a diffent message
with the volume id when we fail for boot form volume instnaces.that is out of scope of thei change but can you add unit test coverage for this?
i.e. assert when we invoke this with a bfv instance the message contains
在补充完单测后,sean 大佬终于给了 +2。
探索
在社区的 bug 栏目逛了逛,发现由 image meta 造成的 bug 还不少,一共有 4 个。这里按照时间顺序列一下。
1、2021-05-11,SEV enabled instance unable to hard reboot Edit,image_meta
缺少name
键导致报错,于是换成了id
键(没想到后面也可能缺少id
键...)。
2、2023-02-17,Validation of memory encryption constraints fails as img properties are not present,image_meta 缺少相关属性导致报错
3、2023-10-27,nova api returns 500 when creating a volume booted instance with memory enryption enabled,即本文处理的问题,image_meta 缺少 id 导致报错
4、2023-12-26,nova api returns 500 when resizing an instance with memory encryption enabled,一样的问题,image_meta 缺少 id 信息导致报错
总结
1、在问题复现的过程中,问题本身的复现并不困难,反而是搭建一套可用的 openstack 环境比较麻烦:)
2、社区大佬非常友好,沟通非常 nice,给的意见也非常到位,一定要多和大佬进行交流。
3、看待问题时还是要全面深入一些,比如在修这个 bug 的时候我就只想到了,从 volume 创建虚机时 image 缺 id 那我就补 id,没想到 volume 属性中不一定包含 image id。
4、感谢 sean mooney 大佬。
5、1月份提交的代码,7月份才合并,真不容易,但是非常开心,感觉自己也给社区做出了一点微小的贡献。
赏