ChatGPT没有帮我写出文章，但是帮我写代码意外的好用

搞代码不是我的擅长，但是我日常工作中的一些小需求，已经不需要从头开始学代码了，安装一个Python环境和一些库，就可以用人类自然语言来和ChatGPT4提需求，已经通过回复系统提示来调试代码。
以下是一个成功的例子，有兴趣可以看看我是怎么用“嘴”调试代码的。
这些代码的功能，是我把一堆URL放到一个txt文档里，然后代码会挨个打开这些网页，然后把正文保存到txt文件里，为了能以后还能追踪这个网页，URL还会写在txt文件的第一行，然后标题隔一行开始。
# ChatGPT4 对话记录

> 导出时间: 2023-05-27 16:22:02 +08:00

## 如何批量将URL下载到本地，只保留正文部分

> You are ChatGPT, a large language model trained by OpenAI. Follow the user&#39;s instructions carefully. Respond using markdown.

**你:**
如何批量将URL下载到本地，只保留正文部分

**ChatGPT4:**
批量将URL下载到本地并只保留正文部分，可以通过编写一个Python脚本实现。这里以使用 `requests` 和 `beautifulsoup4` 库为例。安装所需库：

```
pip install requests
pip install beautifulsoup4
```

然后，按照以下步骤执行操作：

1. 在一个文本文件中（如 `input_urls.txt`）存储所需下载的URL，每行一个链接。
2. 编写Python脚本，读取这个文本文件，下载并提取正文。

```python
import requests
from bs4 import BeautifulSoup

# 读取input_urls.txt文件里的所有URL
with open(&#39;input_urls.txt&#39;, &#39;r&#39;) as f:
    urls = f.readlines()

#遍历URLs, 下载并提取正文部分
for i, url in enumerate(urls):
    url = url.strip()

    try:
        # 请求URL
        response = requests.get(url)

        # 检查是否成功请求
        if response.status_code != 200:
            print(f&quot;请求失败：{url}&quot;)
            continue

        # BeautifulSoup解析
        soup = BeautifulSoup(response.content, &#39;html.parser&#39;)

        # 获取正文内容（根据具体网页的情况，可能需要修改这部分）
        main_content = soup.find(&#39;body&#39;)

        # 保存正文内容到本地文件
        with open(f&#39;output_{i}.txt&#39;, &#39;w&#39;) as outfile:
            outfile.write(main_content.get_text())

        print(f&quot;成功保存：{url}&quot;)

    except Exception as e:
        print(f&quot;处理 {url} 时发生异常: {e}&quot;)
```

运行此脚本，URL会被逐个访问并下载。正文内容（html中的 body 部分）将被提取并以普通文本格式(output_x.txt)保存到本地。注意：根据实际网页情况，提取正文内容的方式可能有所不同，因此可能需要针对不同网页进行相应的调整。

**你:**
系统提示
FileNotFoundError: [Errno 2] No such file or directory: &#39;input_urls.txt&#39;

**ChatGPT4:**
这个错误提示是因为Python脚本未能在当前运行目录中找到名为`input_urls.txt`的文件。请确保以下几点：

1. 创建一个名为`input_urls.txt`的文件（注意大小写和文件扩展名）。
2. 确保将`input_urls.txt`文件与你的Python脚本保存在相同的目录中。
3. 确保在这个目录中运行Python脚本。

这是一个示例，创建一个名为`input_urls.txt`的文件，并在其中添加几个链接：

```
https://www.example1.com
https://www.example2.com
https://www.example3.com
```

然后，确保Python脚本在与此文件相同的目录中运行。如果错误仍然出现，请检查目录路径和文件名是否输入正确。

如果要指定文件的全路径，你可以像这样修改脚本中读取文件的部分：

```python
import os

# 指定文件全路径
file_path = os.path.abspath(&#39;input_urls.txt&#39;)

# 读取input_urls.txt文件里的所有URL
with open(file_path, &#39;r&#39;) as f:
    urls = f.readlines()
```

请确保将`input_urls.txt`的文件路径替换为正确的路径。然后，运行脚本。

**你:**
问题解决了，请帮我进一步完善代码。要求txt文件开头，写上URL地址。文件名格式“网页标题.txt”

**ChatGPT4:**
根据您的要求，我们需要修改脚本以保存文件时包含URL地址和使用网页标题作为文件名。以下是修改后的脚本：

```python
import requests
from bs4 import BeautifulSoup
import os

# 读取input_urls.txt文件里的所有URL
with open(&#39;input_urls.txt&#39;, &#39;r&#39;) as f:
    urls = f.readlines()

# 遍历URLs, 下载并提取正文部分
for url in urls:
    url = url.strip()

    try:
        # 请求URL
        response = requests.get(url)

        # 检查是否成功请求
        if response.status_code != 200:
            print(f&quot;请求失败：{url}&quot;)
            continue

        # BeautifulSoup解析
        soup = BeautifulSoup(response.content, &#39;html.parser&#39;)

        # 获取网页标题
        title = soup.title.string.strip() if soup.title else &quot;Untitled&quot;

        # 替换网页标题中的非法字符
        valid_filename_chars = &quot;-_.() %s%s&quot; % (string.ascii_letters, string.digits)
        title = &#39;&#39;.join(c for c in title if c in valid_filename_chars)

        # 获取正文内容（根据具体网页的情况，可能需要修改这部分）
        main_content = soup.find(&#39;body&#39;)

        # 保存正文内容到本地文件
        with open(f&#39;{title}.txt&#39;, &#39;w&#39;) as outfile:
            outfile.write(f&quot;原文链接：{url}\n\n&quot;)
            outfile.write(main_content.get_text())

        print(f&quot;成功保存：{url}&quot;)

    except Exception as e:
        print(f&quot;处理 {url} 时发生异常: {e}&quot;)
```

根据您的要求，脚本在每个输出文件的开头添加了URL地址。此外，它还提取了网页标题，并将其用作输出文件名。请注意，我们还替换了标题中的非法文件名字符，以避免保存文件时出现问题。

再次运行该脚本，您会发现输出文件的文件名格式为“网页标题.txt”，并且正文内容之前包含了原始URL地址。如果有任何问题，请随时告诉我。

**你:**
系统提示
时发生异常: name &#39;string&#39; is not defined

**ChatGPT4:**
非常抱歉，我在之前的回答中忘记导入 `string` 模块了。请在脚本开始部分添加 `import string`。以下是修复后的完整脚本：

```python
import requests
from bs4 import BeautifulSoup
import os
import string

# 读取input_urls.txt文件里的所有URL
with open(&#39;input_urls.txt&#39;, &#39;r&#39;) as f:
    urls = f.readlines()

# 遍历URLs, 下载并提取正文部分
for url in urls:
    url = url.strip()

    try:
        # 请求URL
        response = requests.get(url)

        # 检查是否成功请求
        if response.status_code != 200:
            print(f&quot;请求失败：{url}&quot;)
            continue

        # BeautifulSoup解析
        soup = BeautifulSoup(response.content, &#39;html.parser&#39;)

        # 获取网页标题
        title = soup.title.string.strip() if soup.title else &quot;Untitled&quot;

        # 替换网页标题中的非法字符
        valid_filename_chars = &quot;-_.() %s%s&quot; % (string.ascii_letters, string.digits)
        title = &#39;&#39;.join(c for c in title if c in valid_filename_chars)

        # 获取正文内容（根据具体网页的情况，可能需要修改这部分）
        main_content = soup.find(&#39;body&#39;)

        # 保存正文内容到本地文件
        with open(f&#39;{title}.txt&#39;, &#39;w&#39;) as outfile:
            outfile.write(f&quot;原文链接：{url}\n\n&quot;)
            outfile.write(main_content.get_text())

        print(f&quot;成功保存：{url}&quot;)

    except Exception as e:
        print(f&quot;处理 {url} 时发生异常: {e}&quot;)
```

现在运行修改后的脚本应该不会出现此类问题。如果您还遇到其他问题，请随时告诉我。

**你:**
已经成功运行了，但是txt文件中的内容没有分段，是否可以改善代码，保留原有网页正文中的分段。

**ChatGPT4:**
为了保留原始网页正文的分段，我们可以修改代码中提取正文内容的部分。请注意，不同的网页可能使用不同的标签来表示正文和段落。在此示例中，我们假设正文中的段落由`&lt;p&gt;`标签表示。您可能需要根据要下载的特定网页进行调整。

以下是修改后的脚本：

```python
import requests
from bs4 import BeautifulSoup
import os
import string

# 读取input_urls.txt文件里的所有URL
with open(&#39;input_urls.txt&#39;, &#39;r&#39;) as f:
    urls = f.readlines()

# 遍历URLs, 下载并提取正文部分
for url in urls:
    url = url.strip()

    try:
        # 请求URL
        response = requests.get(url)

        # 检查是否成功请求
        if response.status_code != 200:
            print(f&quot;请求失败：{url}&quot;)
            continue

        # BeautifulSoup解析
        soup = BeautifulSoup(response.content, &#39;html.parser&#39;)

        # 获取网页标题
        title = soup.title.string.strip() if soup.title else &quot;Untitled&quot;

        # 替换网页标题中的非法字符
        valid_filename_chars = &quot;-_.() %s%s&quot; % (string.ascii_letters, string.digits)
        title = &#39;&#39;.join(c for c in title if c in valid_filename_chars)

        # 获取正文内容（根据具体网页的情况，可能需要修改这部分）
        main_content = soup.find(&#39;body&#39;)

        # 提取正文中的段落并保存到本地文件
        with open(f&#39;{title}.txt&#39;, &#39;w&#39;) as outfile:
            outfile.write(f&quot;原文链接：{url}\n\n&quot;)
            paragraphs = main_content.find_all(&#39;p&#39;)
            for paragraph in paragraphs:
                outfile.write(paragraph.get_text() + &quot;\n\n&quot;)

        print(f&quot;成功保存：{url}&quot;)

    except Exception as e:
        print(f&quot;处理 {url} 时发生异常: {e}&quot;)
```

在这个修改后的脚本中，我们首先为正文内容下的所有`&lt;p&gt;`标签找到段落，并分别添加到输出文件中，同时在段落之间保留空行。这样，您在输出的txt文件中就能看到原始正文的分段了。如果需要进一步帮助，请随时告诉我。

**你:**
可以正常运行，请帮我改善代码。把网页的标题也写在txt文件中，位置是URL之后另一起行，在正文之前

**ChatGPT4:**
根据您的要求，我们可以在保存正文内容之前将标题写入输出文件。以下是修改后的脚本：

```python
import requests
from bs4 import BeautifulSoup
import os
import string

# 读取input_urls.txt文件里的所有URL
with open(&#39;input_urls.txt&#39;, &#39;r&#39;) as f:
    urls = f.readlines()

# 遍历URLs, 下载并提取正文部分
for url in urls:
    url = url.strip()

    try:
        # 请求URL
        response = requests.get(url)

        # 检查是否成功请求
        if response.status_code != 200:
            print(f&quot;请求失败：{url}&quot;)
            continue

        # BeautifulSoup解析
        soup = BeautifulSoup(response.content, &#39;html.parser&#39;)

        # 获取网页标题
        title = soup.title.string.strip() if soup.title else &quot;Untitled&quot;

        # 替换网页标题中的非法字符
        valid_filename_chars = &quot;-_.() %s%s&quot; % (string.ascii_letters, string.digits)
        title = &#39;&#39;.join(c for c in title if c in valid_filename_chars)

        # 获取正文内容（根据具体网页的情况，可能需要修改这部分）
        main_content = soup.find(&#39;body&#39;)

        # 提取正文中的段落并保存到本地文件
        with open(f&#39;{title}.txt&#39;, &#39;w&#39;) as outfile:
            outfile.write(f&quot;原文链接：{url}\n\n&quot;)
            outfile.write(f&quot;标题：{title}\n\n&quot;)  # 将标题写入输出文件
            paragraphs = main_content.find_all(&#39;p&#39;)
            for paragraph in paragraphs:
                outfile.write(paragraph.get_text() + &quot;\n\n&quot;)

        print(f&quot;成功保存：{url}&quot;)

    except Exception as e:
        print(f&quot;处理 {url} 时发生异常: {e}&quot;)
```

现在，在输出文件中，URL和正文之间另起一行写入了标题。在运行此脚本后，您会在每个输出文件中看到`原文链接`、`标题`和正文内容。如果您需要进一步的帮助，请随时告诉我。

**你:**
可以运行，但是发现一个问题，中文标题没有被正确识别出来，是否可以改善代码支持中文

**ChatGPT4:**
为了让脚本正确处理中文标题，我们需要修改用于替换网页标题中的非法字符的部分。这是修改后的脚本：

```python
import requests
from bs4 import BeautifulSoup
import os
import string
import re

# 读取input_urls.txt文件里的所有URL
with open(&#39;input_urls.txt&#39;, &#39;r&#39;) as f:
    urls = f.readlines()

# 遍历URLs, 下载并提取正文部分
for url in urls:
    url = url.strip()

    try:
        # 请求URL
        response = requests.get(url)

        # 检查是否成功请求
        if response.status_code != 200:
            print(f&quot;请求失败：{url}&quot;)
            continue

        # BeautifulSoup解析
        soup = BeautifulSoup(response.content, &#39;html.parser&#39;)

        # 获取网页标题
        title = soup.title.string.strip() if soup.title else &quot;Untitled&quot;

        # 替换网页标题中的非法字符
        valid_filename_chars = &quot;-_.() %s%s&quot; % (string.ascii_letters, string.digits)
        title = re.sub(r&#39;[^\w\s-_\u4e00-\u9fa5]&#39;, &#39;&#39;, title)  # 支持中文

        # 获取正文内容（根据具体网页的情况，可能需要修改这部分）
        main_content = soup.find(&#39;body&#39;)

        # 提取正文中的段落并保存到本地文件
        with open(f&#39;{title}.txt&#39;, &#39;w&#39;) as outfile:
            outfile.write(f&quot;原文链接：{url}\n\n&quot;)
            outfile.write(f&quot;标题：{title}\n\n&quot;)  # 将标题写入输出文件
            paragraphs = main_content.find_all(&#39;p&#39;)
            for paragraph in paragraphs:
                outfile.write(paragraph.get_text() + &quot;\n\n&quot;)

        print(f&quot;成功保存：{url}&quot;)

    except Exception as e:
        print(f&quot;处理 {url} 时发生异常: {e}&quot;)
```

在修改后的脚本中，我们使用正则表达式`re.sub()`来替换标题中的非法字符，同时允许中文字符。这样，脚本应该能正确处理包含中文的标题。

结果输出文件将包含正确识别的中文标题。如果您需要进一步的帮助，请随时告诉我。

**你:**
系统提示
发生异常: bad character range \s-_ at position 4

**ChatGPT4:**
抱歉给您带来了困扰。异常是由于正则表达式中的连字符没有被正确处理。请使用下面修改过的正则表达式：

```python
title = re.sub(r&#39;[^\w\s\-_\u4e00-\u9fa5]&#39;, &#39;&#39;, title)  # 支持中文
```

这是修正后的完整脚本：

```python
import requests
from bs4 import BeautifulSoup
import os
import string
import re

# 读取input_urls.txt文件里的所有URL
with open(&#39;input_urls.txt&#39;, &#39;r&#39;) as f:
    urls = f.readlines()

# 遍历URLs, 下载并提取正文部分
for url in urls:
    url = url.strip()

    try:
        # 请求URL
        response = requests.get(url)

        # 检查是否成功请求
        if response.status_code != 200:
            print(f&quot;请求失败：{url}&quot;)
            continue

        # BeautifulSoup解析
        soup = BeautifulSoup(response.content, &#39;html.parser&#39;)

        # 获取网页标题
        title = soup.title.string.strip() if soup.title else &quot;Untitled&quot;

        # 替换网页标题中的非法字符
        title = re.sub(r&#39;[^\w\s\-_\u4e00-\u9fa5]&#39;, &#39;&#39;, title)  # 支持中文

        # 获取正文内容（根据具体网页的情况，可能需要修改这部分）
        main_content = soup.find(&#39;body&#39;)

        # 提取正文中的段落并保存到本地文件
        with open(f&#39;{title}.txt&#39;, &#39;w&#39;) as outfile:
            outfile.write(f&quot;原文链接：{url}\n\n&quot;)
            outfile.write(f&quot;标题：{title}\n\n&quot;)  # 将标题写入输出文件
            paragraphs = main_content.find_all(&#39;p&#39;)
            for paragraph in paragraphs:
                outfile.write(paragraph.get_text() + &quot;\n\n&quot;)

        print(f&quot;成功保存：{url}&quot;)

    except Exception as e:
        print(f&quot;处理 {url} 时发生异常: {e}&quot;)
```

现在运行这个修改后的脚本，它应该正常处理含有中文的标题，不再出现异常。如有任何问题，请随时告诉我。

**你:**
已经可以完美运行了。我另外有一个需求，基于这段代码，改造成遍历脚本所在的目录，将所有的html文件转换为txt文件。

**ChatGPT4:**
为了将此脚本改造成遍历当前目录，并将所有的html文件转换为txt文件，您可以按照以下步骤操作：

1. 使用 `os` 库列出当前目录下的所有文件。
2. 筛选出扩展名为 `.html` 的文件。
3. 对于每个html文件，执行类似的内容提取操作，并将输出保存为txt文件。

以下是改造后的脚本：

```python
import requests
from bs4 import BeautifulSoup
import os
import string
import re

# 列出当前目录下的所有文件
files_in_directory = os.listdir()
html_files = [file for file in files_in_directory if file.endswith(&quot;.html&quot;)]

# 遍历所有html文件
for html_file in html_files:
    with open(html_file, &#39;r&#39;, encoding=&#39;utf-8&#39;) as f:
        content = f.read()

        try:
            # BeautifulSoup解析
            soup = BeautifulSoup(content, &#39;html.parser&#39;)

            # 获取网页标题
            title = soup.title.string.strip() if soup.title else &quot;Untitled&quot;

            # 替换网页标题中的非法字符
            title = re.sub(r&#39;[^\w\s\-_\u4e00-\u9fa5]&#39;, &#39;&#39;, title)  # 支持中文

            # 获取正文内容（根据具体网页的情况，可能需要修改这部分）
            main_content = soup.find(&#39;body&#39;)

            # 提取正文中的段落并保存到本地文件
            with open(f&#39;{title}.txt&#39;, &#39;w&#39;) as outfile:
                paragraphs = main_content.find_all(&#39;p&#39;)
                for paragraph in paragraphs:
                    outfile.write(paragraph.get_text() + &quot;\n\n&quot;)

            print(f&quot;成功保存：{html_file}&quot;)

        except Exception as e:
            print(f&quot;处理 {html_file} 时发生异常: {e}&quot;)
```

现在，当您在包含html文件的目录中运行此脚本时，它将遍历目录，逐个提取html文件的正文内容，并将其保存为.txt文件。

注意：这个脚本假设所有的html文件都是utf-8编码。如有需要，请根据实际情况调整文件编码。

如果您需要进一步的帮助，请随时告诉我。

**你:**
代码运行的很完美，有一个需要完善的地方，请将网页标题写入txt文件的第一行，正文另起一行。

**ChatGPT4:**
根据您的要求，以下是修改后的脚本，将网页标题写入txt文件的第一行，正文另起一行：

```python
import requests
from bs4 import BeautifulSoup
import os
import string
import re

# 列出当前目录下的所有文件
files_in_directory = os.listdir()
html_files = [file for file in files_in_directory if file.endswith(&quot;.html&quot;)]

# 遍历所有html文件
for html_file in html_files:
    with open(html_file, &#39;r&#39;, encoding=&#39;utf-8&#39;) as f:
        content = f.read()

        try:
            # BeautifulSoup解析
            soup = BeautifulSoup(content, &#39;html.parser&#39;)

            # 获取网页标题
            title = soup.title.string.strip() if soup.title else &quot;Untitled&quot;

            # 替换网页标题中的非法字符
            title = re.sub(r&#39;[^\w\s\-_\u4e00-\u9fa5]&#39;, &#39;&#39;, title)  # 支持中文

            # 获取正文内容（根据具体网页的情况，可能需要修改这部分）
            main_content = soup.find(&#39;body&#39;)

            # 提取正文中的段落并保存到本地文件
            with open(f&#39;{title}.txt&#39;, &#39;w&#39;) as outfile:
                outfile.write(f&quot;{title}\n\n&quot;)  # 将标题写入输出文件的第一行
                paragraphs = main_content.find_all(&#39;p&#39;)
                for paragraph in paragraphs:
                    outfile.write(paragraph.get_text() + &quot;\n\n&quot;)

            print(f&quot;成功保存：{html_file}&quot;)

        except Exception as e:
            print(f&quot;处理 {html_file} 时发生异常: {e}&quot;)
```

该脚本现在会将网页的标题直接写入每个输出txt文件的第一行。在标题之后，正文内容将从新的一行开始。您可以根据需要自定义其它细节。如果需要进一步的帮助，请随时告诉我。