从 XML 文档生成嵌套列表

小能豆

从 XML 文档生成嵌套列表

使用 Python，我的目标是解析我制作的 XML 文档并创建一个嵌套列表，以便稍后访问它们并解析提要。 XML 文档类似于以下代码片段：

<?xml version="1.0'>
<sources>
    <!--Source List by Institution-->
    <sourceList source="cbc">
        <f>http://rss.cbc.ca/lineup/topstories.xml</f>
    </sourceList>
    <sourceList source="bbc">
        <f>http://feeds.bbci.co.uk/news/rss.xml</f>
        <f>http://feeds.bbci.co.uk/news/world/rss.xml</f>
        <f>http://feeds.bbci.co.uk/news/uk/rss.xml</f>
    </sourceList>
    <sourceList source="reuters">
        <f>http://feeds.reuters.com/reuters/topNews</f>
        <f>http://feeds.reuters.com/news/artsculture</f>
    </sourceList>
</sources>

我希望有类似嵌套列表的东西，其中最内层的列表是标签之间的内容<f></f>，上面的列表将使用来源的名称创建，例如source="reuters"路透社。从 XML 文档检索信息不是问题，我正在使用elementtree循环检索node.get('source')等方式进行检索。问题是我无法从不同的来源生成具有所需名称和不同长度的列表。我尝试过附加，但不确定如何将检索到的名称附加到列表中。字典会更好吗？在这种情况下最佳做法是什么？我该如何实现它？如果需要更多信息，请发表评论，我一定会添加它。

阅读 14

2024-12-12

共1个答案

小能豆

要解析 XML 并以结构化方式存储数据（例如嵌套列表或字典），字典是一个更灵活、更易于访问的选项，因为它可以轻松地根据源名称检索数据。以下是如何使用 Python 的 xml.etree.ElementTree 模块实现这一目标。

代码实现

import xml.etree.ElementTree as ET

# 示例 XML 数据
xml_data = """
<sources>
    <!--Source List by Institution-->
    <sourceList source="cbc">
        <f>http://rss.cbc.ca/lineup/topstories.xml</f>
    </sourceList>
    <sourceList source="bbc">
        <f>http://feeds.bbci.co.uk/news/rss.xml</f>
        <f>http://feeds.bbci.co.uk/news/world/rss.xml</f>
        <f>http://feeds.bbci.co.uk/news/uk/rss.xml</f>
    </sourceList>
    <sourceList source="reuters">
        <f>http://feeds.reuters.com/reuters/topNews</f>
        <f>http://feeds.reuters.com/news/artsculture</f>
    </sourceList>
</sources>
"""

# 解析 XML 数据
root = ET.fromstring(xml_data)

# 创建字典来存储解析结果
sources_dict = {}

# 遍历 sourceList 节点
for source_list in root.findall("sourceList"):
    source_name = source_list.get("source")  # 获取 source 属性值
    feed_urls = [f.text for f in source_list.findall("f")]  # 获取所有 <f> 标签内容
    sources_dict[source_name] = feed_urls  # 添加到字典中

# 打印结果
print(sources_dict)

输出结果

{
    'cbc': ['http://rss.cbc.ca/lineup/topstories.xml'],
    'bbc': [
        'http://feeds.bbci.co.uk/news/rss.xml',
        'http://feeds.bbci.co.uk/news/world/rss.xml',
        'http://feeds.bbci.co.uk/news/uk/rss.xml'
    ],
    'reuters': [
        'http://feeds.reuters.com/reuters/topNews',
        'http://feeds.reuters.com/news/artsculture'
    ]
}

如何访问数据

按来源访问：
python print(sources_dict['bbc']) # 输出: ['http://feeds.bbci.co.uk/news/rss.xml', 'http://feeds.bbci.co.uk/news/world/rss.xml', 'http://feeds.bbci.co.uk/news/uk/rss.xml']
遍历所有来源及其提要：
python for source, feeds in sources_dict.items(): print(f"Source: {source}") for feed in feeds: print(f" Feed: {feed}")

字典 vs 嵌套列表

字典：
优点：
- 易于按来源名称直接访问。
- 清晰的结构，适合表示键值对。
使用场景：需要高效查找和组织化访问。
嵌套列表：
示例：[['cbc', ['url1']], ['bbc', ['url2', 'url3']]]
缺点：需要额外遍历来查找特定来源。
使用场景：简单、不需要频繁查找的用例。

扩展功能

将数据保存到 JSON 文件：
```python
import json

with open(“feeds.json”, “w”, encoding=”utf-8”) as f:
json.dump(sources_dict, f, indent=4)
```

从 XML 文件加载：
python tree = ET.parse("feeds.xml") root = tree.getroot()
数据验证：
在解析时检查 source 或 <f> 是否存在，并跳过无效节点。

此代码结构清晰、易扩展，是解析并管理 XML 数据的良好实践。

2024-12-12