3.20 增加去重逻辑，同category下只能有唯一title

2025-03-20 10:12:42 +08:00 · 2025-03-20 10:12:42 +08:00 · f940444428
commit f940444428
parent 4c85fcf160
5 changed files with 404 additions and 41 deletions
--- a/2
+++ b/2
@ -15,4 +15,4 @@ COPY . .
 ENV PYTHONPATH="/markdown_operation"

 # 设置容器启动时运行的命令，比如运行 main.py
-CMD ["/bin/bash"]
+#CMD ["/bin/bash"]
--- a/README.md
+++ b/README.md
@ -1,6 +1,269 @@
-将本地文件夹下的markdown文件发布到typecho的站点中
+## 同步本地Markdown至Typecho站点
+
+场景：本人喜欢在本地用Typora写markdown文件，但又想同时同步至Typecho发表成文章；且由于md文件并不是一成不变的，经常需要对各个文件缝缝补补，要能实现本地更新/同步至博客更新。
+
+亲测适配：Typecho1.2  php7.4.33
+
+### 项目目录
+
+![image-20250319173057792](D:\folder\study\md_files\output\image-20250319173057792.png)
+
+### **核心思路**
+
+**一、预先准备**
+
+**图床服务器**
+
+本人自己在服务器上搭建了私人图床——easyimage，该服务能够实现图片上传并返回公网 URL，这对于在博客中正常显示 Markdown 文件中的图片至关重要。
+
+当然，也可以选择使用公共图床服务，如阿里云 OSS，但这里不做详细介绍。
+
+需手动修改`transfer_md/upload_img.py`，配置url、token等信息。
+
+参考博客：[【好玩儿的Docker项目】10分钟搭建一个简单图床——Easyimage-我不是咕咕鸽](https://blog.laoda.de/archives/docker-compose-install-easyimage)
+
+github地址：[icret/EasyImages2.0: 简单图床 - 一款功能强大无数据库的图床 2.0版](https://github.com/icret/EasyImages2.0)
+
+
+
+**picgo安装：**
+
+使用 Typora + PicGo + Easyimage 组合，可以实现将本地图片直接粘贴到 Markdown 文件中，并自动上传至图床。
+
+下载地址：[Releases · Molunerfinn/PicGo](https://github.com/Molunerfinn/PicGo/releases)
+
+操作步骤如下：
+
+1. 打开 PicGo，点击右下角的小窗口。
+2. 进入插件设置，搜索并安装 `web-uploader 1.1.1` 插件（注意：旧版本可能无法搜索到，建议直接安装最新版本）。
+3. 配置插件：在设置中填写 API 地址，该地址可在 Easyimage 的“设置-API设置”中获取。
+
+配置完成后，即可实现图片自动上传，提升 Markdown 编辑体验。
+
+<img src="D:\folder\study\md_files\output\image-20250319180022461.png" alt="image-20250319180022461" style="zoom:67%;" />
+
+
+
+**Typora 设置**
+
+为了确保在博客中图片能正常显示，编辑 Markdown 文档时**必须将图片上传至图床**，而不是保存在本地。请按以下步骤进行配置：
+
+1. 在 Typora 中，打开 **文件 → 偏好设置 → 图像** 选项。
+2. 在 “插入图片时” 选项中，选择 **上传图片**。
+3. 在 “上传服务设定” 中选择 **PicGo**，并指定 PicGo 的安装路径。
+
+![image-20250319175707761](D:\folder\study\md_files\output\image-20250319175707761.png)
+
+
+
+**文件结构统一**：
+
+```
+md_files
+├── category1
+│   ├── file1.md
+│   └── file2.md
+├── category2
+│   ├── file3.md
+│   └── file4.md
+└── output
+    ├── image1.png
+    ├── image2.jpg
+    └── ... (其他图片文件)
+
+```
+
+**注意**：category对应上传到typecho中的文章所属的分类。
+
+如果你现有的图片分散在系统中，可以使用 `transfer_md/transfer.py` 脚本来统一处理。该脚本需要传入三个参数：
+
+- **input_path：** 指定包含 Markdown 文件的根目录（例如上例中的 `md_files`）。
+- **output_path：** 指定统一存放处理后图片的目标文件夹（例如上例中的 `output`）。
+- **type_value**：
+  - `1`：扫描 `input_path` 下所有 Markdown 文件，将其中引用的本地图片复制到 `output_path` 中，同时更新 Markdown 文件中的图片 URL 为 `output_path` 内的路径；
+  - `2`：为每个 Markdown 文件建立单独的文件夹（以文件名命名），将 Markdown 文件及其依赖图片存入该文件夹中，图片存放在文件夹下的 `assets` 子目录中，整体保存在 `output_path` 内；
+  - `3`：扫描 Markdown 文件中的本地图片，将其上传到图床（获取公网 URL），并将 Markdown 文件中对应的图片 URL 替换为公网地址。
+
+对于本项目，需要将图片统一用公网URL表示。即`type_value=3`
+
+
+
+**二、使用Git进行版本控制**
+
+假设你在服务器上已经搭建了 Gitea (Github、Gitee都行)并创建了一个名为 `md_files` 的仓库，那么你可以在 `md_files` 文件夹下通过 Git Bash 执行以下步骤将本地文件提交到远程仓库：
+
+**初始化本地仓库**：
+
+```
+git init
+```
+
+**添加远程仓库**：
+
+将远程仓库地址添加为 `origin`（请将 `http://xxx` 替换为你的实际仓库地址）：
+
+```
+git remote add origin http://xxx
+```
+
+**添加文件并提交**：
+
+```
+git add .
+git commit -m "Initial commit"
+```
+
+**推送到远程仓库：**
+
+```
+git push -u origin master
+```
+
+**后续更新：**
+
+```
+git add .
+git commit -m "更新了xxx内容"
+git push
+```
+
+
+
+**三、在服务器上部署该脚本**
+
+**1. 确保脚本能够连接到 Typecho 使用的数据库**
+
+本博客使用 docker-compose 部署 Typecho（参考：[【好玩儿的Docker项目】10分钟搭建一个Typecho博客｜太破口！念念不忘，必有回响！-我不是咕咕鸽](https://blog.laoda.de/archives/docker-compose-install-typecho)）。为了让脚本能访问 Typecho 的数据库，我将 Python 应用也通过 docker-compose 部署，这样所有服务均在同一网络中，互相之间可以直接通信。
+
+参考docker-compose.yml如下：
+
+```
+services:
+  nginx:
+    image: nginx
+    ports:
+      - "4000:80"    # 左边可以改成任意没使用的端口
+    restart: always
+    environment:
+      - TZ=Asia/Shanghai
+    volumes:
+      - ./typecho:/var/www/html
+      - ./nginx:/etc/nginx/conf.d
+      - ./logs:/var/log/nginx
+    depends_on:
+      - php
+    networks:
+      - web
+
+  php:
+    build: php
+    restart: always
+    expose:
+      - "9000"       # 不暴露公网，故没有写9000:9000
+    volumes:
+      - ./typecho:/var/www/html
+    environment:
+      - TZ=Asia/Shanghai
+    depends_on:
+      - mysql
+    networks:
+      - web
+  pyapp:
+    build: ./markdown_operation  # Dockerfile所在的目录
+    restart: "no"
+    networks:
+      - web
+    env_file:
+      - .env
+    depends_on:
+      - mysql
+  mysql:
+    image: mysql:5.7
+    restart: always
+    environment:
+      - TZ=Asia/Shanghai
+    expose:
+      - "3306"  # 不暴露公网，故没有写3306:3306
+    volumes:
+      - ./mysql/data:/var/lib/mysql
+      - ./mysql/logs:/var/log/mysql
+      - ./mysql/conf:/etc/mysql/conf.d
+    env_file:
+      - mysql.env
+    networks:
+      - web
+
+networks:
+  web:
+```
+
+注意：如果你不是用docker部署的typecho，只要保证脚本能连上typecho所使用的数据库并操纵里面的表就行！
+
+
+
+**2. 将版本控制的 md_files 仓库克隆到 markdown_operation 目录中**
+
+确保在容器内可以直接访问到 md_files 内容，因此我们将使用 Git 进行版本控制的 md_files 仓库克隆到 markdown_operation 内部。这样，无论是执行脚本还是其他操作，都能轻松访问和更新 Markdown 文件。
+
+
+
+**3.仅针对 `pyapp` 服务进行重构和启动，不影响其他服务的运行：**
+
+`pyapp`是本Python应用在容器内的名称。
+
+构建镜像：
+
+```
+docker-compose build pyapp 
+```
+
+启动容器并进入 Bash：
+
+```
+docker-compose run --rm -it pyapp /bin/bash
+```
+
+在容器内运行脚本：
+
+```
+python typecho_markdown_upload/main.py
+```
+
+此时可以打开博客验证一下是否成功发布文章了！
+
+**如果失败，可以验证mysql数据库:**
+
+1️⃣ 进入 MySQL 容器：
+
+```
+docker compose exec mysql mysql -uroot -p
+# 输入你的 root 密码
+```
+
+2️⃣ 切换到 Typecho 数据库并列出表：
+
+```
+USE typecho;
+SHOW TABLES;
+```
+
+3️⃣ 查看 `typecho_contents` 表结构（文章表）：
+
+```
+DESCRIBE typecho_contents;
+SHOW CREATE TABLE typecho_contents\G
+```
+
+4️⃣ 查询当前文章数量（确认执行前后有无变化）：
+
+```
+SELECT COUNT(*) AS cnt FROM typecho_contents;
+```
+
+

 ### TODO
+
 - [x] 将markdown发布到typecho
 - [x] 发布前将markdown的图片资源上传到TencentCloud的COS中, 并替换markdown中的图片链接
 - [x] 将md所在的文件夹名称作为post的category(mysql发布可以插入category, xmlrpc接口暂时不支持category操作)
--- a/transfer_md/transfer.py
+++ b/transfer_md/transfer.py
@ -2,8 +2,14 @@ import os
 import re
 import shutil
 import uuid
+
+from dotenv import load_dotenv
+
 from transfer_md.upload_img import upload_image
 from transfer_md.download_img import download_image
+import sys
+# 加载 .env 文件中的环境变量
+load_dotenv()

 def extract_image_paths(content):
    """
@ -13,6 +19,7 @@ def extract_image_paths(content):
    pattern_html = re.compile(r'<img\s+[^>]*src\s*=\s*"(.*?)"')
    return set(pattern_md.findall(content) + pattern_html.findall(content))

+
 def process_local_image_copy(abs_img_path, dest_folder):
    """
    复制本地图片到目标文件夹，并返回新文件名（使用 UUID 命名，保留扩展名）
@ -23,6 +30,7 @@ def process_local_image_copy(abs_img_path, dest_folder):
    shutil.copy2(abs_img_path, dest_path)
    return new_filename

+
 def process_md_file_local(md_file, output_path):
    """
    处理一个 Markdown 文件：
@ -40,6 +48,7 @@ def process_md_file_local(md_file, output_path):

    # 获取当前 md 文件所在目录
    md_dir = os.path.dirname(md_file)
+    abs_output_path = os.path.abspath(output_path)

    for img_path in img_paths:
        # 判断图片路径是本地路径还是网络 URL
@ -56,6 +65,12 @@ def process_md_file_local(md_file, output_path):
                abs_img_path = img_path
            else:
                abs_img_path = os.path.normpath(os.path.join(md_dir, img_path))
+            abs_img_path = os.path.abspath(abs_img_path)
+
+            # 如果图片已经在 output 目录中，直接跳过复制
+            if abs_img_path.startswith(abs_output_path):
+                print(f"跳过已存在于 output 文件夹的图片: {abs_img_path}")
+                continue

            if os.path.exists(abs_img_path):
                if os.path.isfile(abs_img_path):  # 确保是文件而不是文件夹
@ -76,6 +91,7 @@ def process_md_file_local(md_file, output_path):
        f.write(content)
    print(f"已更新: {md_file}")

+
 def process_md_file_with_assets(md_file, output_base_path):
    """
    处理单个 Markdown 文件，将其拷贝到 output_base_path/<md_name>/ 下，
@ -136,6 +152,7 @@ def process_md_file_with_assets(md_file, output_base_path):
        f.write(content)
    print(f"已更新: {target_md_path}")

+
 def process_md_file_remote(md_file):
    """
    处理一个 Markdown 文件：
@ -193,6 +210,7 @@ def scan_files(base_folder, exclude_folders):
                md_files.append(os.path.join(root, file))
    return md_files

+
 def process_md_files(input_path, output_path, type, exclude_folders=None):
    """
    处理输入目录下所有 Markdown 文件，并将处理后的图片保存到 output_path。
@ -214,9 +232,9 @@ def process_md_files(input_path, output_path, type, exclude_folders=None):
        if type == 1:
            process_md_file_local(md_file, output_path)  # url改为本地，图片存output_path
        elif type == 2:
-            process_md_file_with_assets(md_file, output_path)  #url改为本地，图片和md都存output_path
+            process_md_file_with_assets(md_file, output_path)  # url改为本地，assets方式，图片和md文件都存output_path
        elif type == 3:
-            process_md_file_remote(md_file)    #url改公网链接
+            process_md_file_remote(md_file)  # 图片url改为公网链接
        else:
            print(f"未知的处理类型: {type}")

@ -224,7 +242,19 @@ def process_md_files(input_path, output_path, type, exclude_folders=None):


 if __name__ == "__main__":
-    type=1
-    input_path = r'D:\folder\study\md_files\Java\zbparse'
-    output_path = r'D:\folder\test\output'
-    process_md_files(input_path,output_path,type)
+    # 从命令行获取 type 参数，如果未传入则默认使用 1
+    if len(sys.argv) > 1:
+        try:
+            type_value = int(sys.argv[1])
+        except ValueError:
+            print("第一个参数必须为整数，表示处理类型（1, 2 或 3）")
+            sys.exit(1)
+    else:
+        type_value = 3
+
+    # 这里的输入输出路径根据实际情况修改
+    # input_path = os.getenv('BASE_FOLDER')
+    input_path=r'D:\folder\study\md_files'
+    # output_path = os.getenv('OUTPUT_FOLDER')
+    output_path=r'D:\folder\study\md_files\output'
+    process_md_files(input_path, output_path, type_value)
--- a/transfer_md/upload_img.py
+++ b/transfer_md/upload_img.py
@ -1,5 +1,9 @@
+import os
 import requests
+# 加载 .env 文件中的环境变量
+from dotenv import load_dotenv

+load_dotenv()
 def upload_image(img_path: str) -> str:
    """
    上传本地图片到 easyimage 图床，并返回图片的公网地址。
@ -11,13 +15,13 @@ def upload_image(img_path: str) -> str:
      图片在图床上的公网地址

    API 参数说明：
-      - API 地址: http://124.71.159.195:1000/api/index.php
+      - API 地址: 图床提供的API
      - 图片文件对应的 POST 参数名: image
-      - 自定义 body 参数: {"token": "1a61048560d9a63430816f98ba5a4fb0"}
+      - 自定义 body 参数: {"token": "xxxxxx"}
      - 响应 JSON 中的图片地址字段路径: url
    """
-    url = "https://pic.bitday.top/api/index.php"
-    token = "3b54c300cba118d185a4f9d2da9af513"
+    url = os.getenv('IMG_URL')
+    token = os.getenv('IMG_TOKEN')

    try:
        with open(img_path, "rb") as f:
--- a/typecho_markdown_upload/typecho_direct_mysql_publisher.py
+++ b/typecho_markdown_upload/typecho_direct_mysql_publisher.py
@ -1,10 +1,7 @@
-#typecho_direct_mysql_publisher.py
 import pymysql
 import time
-
 from pymysql.converters import escape_string

-
 class TypechoDirectMysqlPublisher:
    def __init__(self, host, port, user, password, database, table_prefix):
        self.__table_prefix = table_prefix
@ -22,8 +19,11 @@ class TypechoDirectMysqlPublisher:
        self.__init_categories()

    def __init_categories(self):
+        """
+        初始化分类列表到 self.__exist_categories
+        """
        cursor = self.__db.cursor()
-        sql = "select mid,name from %s where type='%s'" % (self.__categories_table_name, 'category')
+        sql = "SELECT mid, name FROM %s WHERE type='category'" % self.__categories_table_name
        cursor.execute(sql)
        results = cursor.fetchall()
        self.__exist_categories = []
@ -34,55 +34,121 @@ class TypechoDirectMysqlPublisher:
            })

    def __get_category_id(self, category_name):
-        if len(self.__exist_categories) > 0:
+        """
+        从 self.__exist_categories 查找匹配的分类 ID
+        """
        for item in self.__exist_categories:
            if item['name'] == category_name:
                return item['mid']
        return -1

    def __add_category(self, category_name):
+        """
+        如果分类不存在，则插入一条新分类
+        """
        cursor = self.__db.cursor()
-        sql = "INSERT INTO %s " \
-              "(`name`, `slug`, `type`, `description`, `count`, `order`, `parent`) " \
-              "VALUES " \
-              "('%s', '%s', 'category', '', 0, 1, 0)" % (self.__categories_table_name, category_name, category_name)
+        sql = (
+            "INSERT INTO %s "
+            "(`name`, `slug`, `type`, `description`, `count`, `order`, `parent`) "
+            "VALUES "
+            "('%s', '%s', 'category', '', 0, 1, 0)"
+        ) % (self.__categories_table_name, category_name, category_name)
        cursor.execute(sql)
        mid = cursor.lastrowid
        self.__db.commit()
+
+        # 重新初始化分类缓存，避免重复插入
        self.__init_categories()
        return mid

    def __insert_relationship(self, cursor, cid, mid):
-        insert_relationship_sql = "INSERT INTO %s" \
-                                  "(`cid`, `mid`) " \
-                                  "VALUES " \
-                                  "(%d, %d)" % (self.__relationships_table_name, cid, mid)
+        """
+        在 typecho_relationships 中插入文章与分类的关联
+        """
+        insert_relationship_sql = (
+            "INSERT INTO %s "
+            "(`cid`, `mid`) "
+            "VALUES "
+            "(%d, %d)"
+        ) % (self.__relationships_table_name, cid, mid)
        cursor.execute(insert_relationship_sql)

    def __update_category_count(self, cursor, mid):
-        update_category_count_sql = "UPDATE %s SET `count`=`count`+1 WHERE mid=%d" % (self.__categories_table_name, mid)
+        """
+        分类下文章数 +1
+        """
+        update_category_count_sql = (
+            "UPDATE %s SET `count`=`count`+1 WHERE mid=%d"
+        ) % (self.__categories_table_name, mid)
        cursor.execute(update_category_count_sql)

    def publish_post(self, title, content, category):
-        content = '<!--markdown-->' + content
+        """
+        如果同一分类下 (category) 已存在相同 title，则直接返回已存在的 cid；
+        否则插入新文章并返回新 cid。
+        """
+        cursor = self.__db.cursor()
+
+        # 1. 获取分类 ID（不存在则插入）
        mid = self.__get_category_id(category)
        if mid < 0:
            mid = self.__add_category(category)

+        # 2. 查重：同一分类下 (mid) 是否已存在相同 title
+        #    通过连接 contents & relationships 表判断
+        check_sql = """
+            SELECT c.cid
+            FROM %s c
+            JOIN %s r ON c.cid = r.cid
+            WHERE c.title = '%s'
+              AND r.mid = %d
+            LIMIT 1
+        """ % (
+            self.__contents_table_name,
+            self.__relationships_table_name,
+            escape_string(title),
+            mid
+        )
+        cursor.execute(check_sql)
+        exist_row = cursor.fetchone()
+        if exist_row:
+            # 已有同标题文章，直接返回
+            print(f"[INFO] 发现同一分类下已存在相同标题: {title}, cid={exist_row[0]}，跳过插入。")
+            return exist_row[0]
+
+        # 3. 插入新文章
        now_time_int = int(time.time())
-        cursor = self.__db.cursor()
-        sql = "INSERT INTO %s " \
-              "(`title`, `slug`, `created`, `modified`, `text`, `order`, `authorId`, `template`, `type`, `status`, `password`, `commentsNum`, `allowComment`, `allowPing`, `allowFeed`, `parent`) " \
-              "VALUES " \
-              "('%s', NULL , %d, %d, '%s', 0, 1, NULL, 'post', 'publish', NULL, 0, '1', '1', '1', 0)" \
-              "" % (self.__contents_table_name, escape_string(title), now_time_int, now_time_int, escape_string(content))
-        cursor.execute(sql)
+        content = '<!--markdown-->' + content
+
+        insert_sql = (
+            "INSERT INTO %s "
+            "(`title`, `slug`, `created`, `modified`, `text`, `order`, `authorId`, `template`, `type`, `status`, `password`, `commentsNum`, `allowComment`, `allowPing`, `allowFeed`, `parent`) "
+            "VALUES "
+            "('%s', NULL, %d, %d, '%s', 0, 1, NULL, 'post', 'publish', NULL, 0, '1', '1', '1', 0)"
+        ) % (
+            self.__contents_table_name,
+            escape_string(title),
+            now_time_int,
+            now_time_int,
+            escape_string(content)
+        )
+        cursor.execute(insert_sql)
        cid = cursor.lastrowid
-        update_slug_sql = "UPDATE %s SET slug=%d WHERE cid=%d" % (self.__contents_table_name, cid, cid)
+
+        # 4. 更新 slug = cid
+        update_slug_sql = (
+            "UPDATE %s SET slug=%d WHERE cid=%d"
+        ) % (self.__contents_table_name, cid, cid)
        cursor.execute(update_slug_sql)

-        self.__insert_relationship(cursor, cid=cid, mid=mid)
+        # 5. 建立文章与分类的关系
+        self.__insert_relationship(cursor, cid, mid)
+
+        # 6. 更新分类下文章数
        self.__update_category_count(cursor, mid)

+        # 7. 提交
        self.__db.commit()
+
+        print(f"[INFO] 插入新文章成功: title={title}, cid={cid}, category={category}")
        return cid