2.7 添加注释

2025-02-07 15:27:24 +08:00 · 2025-02-07 15:27:24 +08:00 · 50dd6dd3c8
commit 50dd6dd3c8
parent c07ad43a99
39 changed files with 338 additions and 63 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,261 @@
+项目地址：[zy123/zbparse - zbparse - 智标领航代码仓库](http://47.98.59.178:3000/zy123/zbparse)
+
+git clone地址：http://47.98.59.178:3000/zy123/zbparse.git
+
+选择develop分支，develop-xx 后面的xx越近越新。
+
+正式环境：121.41.119.164:5000
+
+测试环境：47.98.58.178:5000
+
+大解析：指从招标文件解析入口进去，upload.py 
+
+小解析：从投标文件生成入口进去，little_zbparse 和get_deviation，两个接口后端一起调
+
+## 项目结构：
+
+![1](md_files/1.png)
+
+.env存放一些密钥（大模型、textin等），它是gitignore忽略了，因此在服务器上git pull项目的时候，这个文件不会更新（因为密钥比较重要），需要手动维护服务器相应位置的.env。
+
+**如何更新服务器上的版本：**
+
+1. 进入项目文件夹
+
+![1](md_files/0.png)
+
+**注意：**需要确认.env是否存在在服务器，默认是隐藏的
+输入cat .env
+如果不存在，在项目文件夹下sudo vim .env
+
+将密钥粘贴进去！！！
+
+2. git pull 
+
+3. sudo docker-compose up --build -d 更新并重启
+
+	或者	sudo docker-compose build  先构建镜像
+
+				sudo docker-compose up -d	等空间时再重启
+
+4. sudo docker-compose logs flask_app --since 1h 查看最近1h的日志（如果重启后报错也能查看，推荐重启后都运行一下这个）
+
+
+
+requirements.txt一般无需变动，除非代码中使用了新的库，也要手动在该文件中添加包名及对应的版本
+
+
+
+**如何本地启动本项目：**
+
+1. requirements.txt里的环境要配好
+2. .env环境配好 （一般不需要在电脑环境变量中额外配置了）
+
+3. 点击下拉框，Edit configurations
+
+ ![1](md_files/11.png)
+
+	设置run_serve.py为启动脚本![1](md_files/10.png)
+
+4. postman打post请求测试：
+
+http://127.0.0.1:5000/upload
+
+body:
+
+{
+
+  "file_url":"xxxx",
+
+  "zb_type":2
+
+}
+
+
+
+## flask_app结构介绍
+
+### general
+
+是公共函数存放的文件夹，llm下是各类大模型，读取文件下是docx pdf文件的读取以及文档清理clean_pdf，去页眉页脚页码
+
+![1](md_files/2.png)
+
+general下的llm下的清除file_id.py 需要**每周运行至少一次**，防止file_id数量超出（我这边对每次请求结束都有file_id记录并清理，向应该还没加）
+
+llm下的model_continue_query是'模型继续回答'脚本，应对超长文本模型一次无法输出完的情况，继续提问，拼接成完整的内容。
+
+
+
+general下的file2markdown是textin 文件--》markdown
+
+general下的format_change是pdf-》docx 或doc/docx->pdf
+
+general下的merge_pdfs.py是拼接文件的：1.拼接招标公告+投标人须知   2.拼接评标细则章节+资格审查章节
+
+
+
+**general中比较重要的！！！**
+
+**后处理：**
+
+general下的**post_processing**,解析后的后处理部分，包括extract_info、 资格审查、技术偏离 商务偏离 所需提交的证明材料，都在这块生成。
+
+post_processing中的**inner_post_processing**专门提取*extracted_info*
+
+post_processing中的**process_functions_in_parallel**提取
+
+资格审查、技术偏离、 商务偏离、 所需提交的证明材料
+
+![1](md_files/14.png)
+
+大解析upload用了post_processing完整版，
+
+little_zbparse.py、小解析main.py用了inner_post_processing
+
+get_deviation.py、偏离表数据解析main.py用了process_functions_in_parallel
+
+
+
+**截取pdf：**
+
+*截取pdf_main.py*是顶级函数，
+
+二级是*截取pdf货物标版*.py和*截取pdf工程标版.py*  （非general下）
+
+三级是*截取pdf通用函数.py*
+
+
+
+**无效标和废标公共代码**
+
+获取无效标与废标项的主要执行代码。对docx文件进行预处理=》正则=》temp.txt=》大模型筛选
+如果提的不全，可能是正则没涵盖到位，也可能是大模型提示词漏选了。
+
+这里：如果段落中既被正则匹配，又被follow_up_keywords中的任意一个匹配，那么不会添加到temp中（即不会被大模型筛选），它会**直接添加**到最后的返回中!
+
+![1](md_files/12.png)
+
+
+
+**投标人须知正文条款提取成json文件**
+
+将截取到的ztbfile_tobidders_notice_part2.pdf ，即须知正文，转为clause1.json 文件，便于后续提取**开评定标流程**、**投标文件要求**、**重新招标、不再招标和终止招标**
+
+这块的主要逻辑就是匹配形如'一、总则'这样的大章节
+
+然后匹配形如'1.1' '1.1.1'这样的序号，由于是按行读取pdf，一个序号后面的内容可能有好几行，因此遇到下一个序号（如'2.1')开头，之前的内容都视为上一个序号的。
+
+
+
+### old_version
+
+都是废弃文件代码，未在正式、测试环境中使用的，不用管
+
+![1](md_files/3.png)
+
+
+
+### routes
+
+是接口以及主要实现部分，一一对应
+
+![1](md_files/4.png)
+
+get_deviation对应偏离表数据解析main，获得偏离表数据
+
+judge_zbfile对应判断是否是招标文件
+
+little_zbparse对应小解析main，负责解析extract_info
+
+test_zbparse是测试接口，无对应
+
+upload对应工程标解析和货物标解析，即大解析
+
+**混淆澄清**：小解析可以指代一个过程，即从'投标文件生成'这个入口进去的解析，后端会同时调用little_zbparse和get_deviation。这个过程称为'小解析'。
+
+但是little_zbparse也叫小解析，命名如此因为最初只需返回这些数据(extract_info)，后续才陆续返回商务、技术偏离...
+
+
+
+utils是接口这块的公共功能函数。其中validate_and_setup_logger函数对不同的接口请求对应到不同的output文件夹，如upload->output1。后续增加接口也可直接在这里写映射关系。
+
+![1](md_files/5.png)
+
+重点关注大解析：**upload.py**和**货物标解析main.py**
+
+
+
+### static
+
+存放解析的输出和提示词
+
+其中output用gitignore了，git push不会推送这块内容。
+
+各个文件夹(output1 output2..)对应不同的接口请求
+
+![1](md_files/6.png)
+
+
+
+### test_case&testdir
+
+test_case是测试用例，是对一些函数的测试。好久没更新了
+
+testdir是平时写代码的测试的地方
+
+它们都不影响正式和测试环境的解析
+
+![1](md_files/7.png)
+
+
+
+### 工程标&货物标
+
+是两个解析流程中不一样的地方（一样的都写在**general**中了）
+
+![1](md_files/8.png)
+
+主要是货物标额外解析了采购要求（提取采购需求main+技术参数要求提取+商务服务其他要求提取）
+
+
+
+### 最后：
+
+ConnectionLimiter.py定义了接口超时时间->超时后断开与后端的连接
+
+![1](md_files/9.png)
+
+logger_setup.py 为每个请求创建单独的log，每个log对应一个log.txt
+
+start_up.py是启动脚本，run_serve也是启动脚本，是对start_up.py的简单封装，目前dockerfile定义的直接使用run_serve启动
+
+
+
+## 持续关注
+
+```
+ yield sse_format(tech_deviation_response)
+ yield sse_format(tech_deviation_star_response)
+ yield sse_format(zigefuhe_deviation_response)
+ yield sse_format(shangwu_deviation_response)
+ yield sse_format(shangwu_star_deviation_response)
+ yield sse_format(proof_materials_response)
+```
+
+1. 工程标解析目前仍没有解析采购要求这一块，因此后处理返回的只有'资格审查'和''证明材料"和"extracted_info"，没有''商务偏离''及'商务带星偏离'，也没有'技术偏离'和'技术带星偏离'，而货物标解析是完全版。
+
+	其中''证明材料"和"extracted_info"是直接返给后端保存的
+
+2. 大解析中返回了技术评分，后端接收后不仅显示给前端，还会返给向，用于生成技术偏离表
+3. 小解析时，get_deviation.py其实也可以返回技术评分，但是没有返回，因为没人和我对接，暂时注释了。
+
+![1](md_files/16.png)
+
+
+
+4.商务评议和技术评议偏离表，即评分细则的偏离表，暂时没做，但是**商务评分、技术评分**无论大解析还是小解析都解析了，稍微对该数据处理一下返回给后端就行。
+
+![1](md_files/17.png)
+
+这个是解析得来的结果，适合给前端展示，但是要生成商务技术评议偏离表的话，需要再调一次大模型，对该数据进行重新归纳，以字符串列表为佳。再传给后端。（未做）
--- a/flask_app/general/post_processing.py
+++ b/flask_app/general/post_processing.py
@ -266,7 +266,8 @@ def outer_post_processing(combined_data, includes, good_list):
    if "基础信息" in includes:
        base_info = combined_data.get("基础信息", {})
        # 调用内层 inner_post_processing 处理 '基础信息'
-        extracted_info = inner_post_processing(base_info)
+        extracted_info = inner_post_processing(base_info)    #生成extract_info，返回给后端
+
        # 将 '基础信息' 保留在处理后的数据中
        processed_data["基础信息"] = base_info
        # 提取 '采购要求' 下的 '采购需求'
@ -291,7 +292,7 @@ def outer_post_processing(combined_data, includes, good_list):
        busi_eval=combined_data.get("商务评分",{})
        busi_eval_info=json.dumps(busi_eval,ensure_ascii=False,indent=4)
    all_data_info = '\n'.join([zige_info, fuhe_info, zigefuhe_info, tech_deviation_info,busi_requirements_info, tech_eval_info,busi_eval_info])
-    tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation,proof_materials = process_functions_in_parallel(
+    tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation,proof_materials = process_functions_in_parallel(    #生成技术 商务偏离的main函数
        tech_requirements_dict=tech_requirements,
        busi_requirements_dict=busi_requirements,
        zige_info=zige_info,
--- a/flask_app/general/投标人须知正文提取指定内容.py
+++ b/flask_app/general/投标人须知正文提取指定内容.py
@ -387,8 +387,8 @@ def extract_from_notice(merged_baseinfo_path, clause_path, type):

    # 映射 type 到 target_values
    type_target_map = {
-        1: ["投标", "投标文件", "响应文件"],
-        2: ["开标", "评标", "定标", "评审", "成交", "合同", "磋商","谈判","中标", "程序", "步骤"],
+        1: ["投标", "投标文件", "响应文件"],   #投标文件要求
+        2: ["开标", "评标", "定标", "评审", "成交", "合同", "磋商","谈判","中标", "程序", "步骤"],   #开评定标流程
        3: ["重新招标、不再招标和终止招标", "重新招标", "重新采购", "不再招标", "不再采购", "终止招标", "终止采购"],
        4: ["评标"]  # 测试
    }
--- a/flask_app/general/投标人须知正文条款提取成json文件.py
+++ b/flask_app/general/投标人须知正文条款提取成json文件.py
@ -10,11 +10,10 @@ from flask_app.general.截取pdf通用函数 import clean_page_content, extract_
 def compare_headings(current, new):
    """
    比较两个标题的层次关系，并确保新标题比当前标题大且最高位数字差值不超过5。
-
+    因为默认新的序号是比旧的序号大的，同时也要防止一些特别大的序号在行首，如'2025年xxx' 错误地将'2025'匹配成序号了，事实上它只是正文的一部分。
    参数:
    current (str): 当前标题，例如 "1.2.3"
    new (str): 新标题，例如 "1.3"
-
    返回:
    bool: 如果新标题大于当前标题且最高位数字差值不超过3，则返回 True，否则返回 False
    """
@ -80,7 +79,7 @@ def parse_text_by_heading(text):
    lines = text.split('\n')

    def get_current_number(key_chinese):
-        chinese_to_number = {
+        chinese_to_number = {              #便于比较中文序号大小
            '一': 1, '二': 2, '三': 3, '四': 4, '五': 5,
            '六': 6, '七': 7, '八': 8, '九': 9, '十': 10,
            '十一': 11, '十二': 12, '十三': 13, '十四': 14, '十五': 15
@ -99,7 +98,7 @@ def parse_text_by_heading(text):
        patterns = [
            r'^(?<![a-zA-Z（(])(\d+(?:\.\d+)+)\s*(.*)',  # 匹配 '12.1 内容'
            r'^(\d+\.)\s*(.+)$',  # 匹配 '12. 内容'
-            r'^[．.](\d+(?:[．.]\d+)*)\s*(.+)$',  # 匹配 '.12.1 内容'
+            r'^[．.](\d+(?:[．.]\d+)*)\s*(.+)$',  # 匹配 '.12.1 内容'   这种点号开头的情况是因为读取pdf时clean_page_content进行了清理，删去了页眉页脚和页码，序号可能会被误判为页码删除，这种情况已在代码中进行了不错的处理。
            r'^(\d+)([^.\d].*)'  # 匹配 '27 内容'
        ]
        for pattern in patterns:
@ -111,9 +110,9 @@ def parse_text_by_heading(text):
    first_five_lines = lines[:5]
    has_initial_heading_patterns = False
    for line in first_five_lines:
-        line_stripped = line.strip().replace('．', '.')
-        if line_stripped.startswith("##"):
-            line_stripped = line_stripped[2:]  # Remove "##"
+        line_stripped = line.strip().replace('．', '.')   #line_stripped是处理中的当前行
+        if line_stripped.startswith("##"):          #预处理时添加的，对每页pdf的首行前打了标记'##' ，便于特殊处理；因为首行的序号往往会因为clean_page_content被错误删除！需补全该序号。
+            line_stripped = line_stripped[2:]       # 移除 "##"
        if (pattern_numbered.match(line_stripped) or pattern_parentheses.match(
                line_stripped) or pattern_letter_initial.match(line_stripped)):
            has_initial_heading_patterns = True
@ -135,14 +134,14 @@ def parse_text_by_heading(text):
        if not match:
            match = re.match(r'^(\d+\.)\s*(.+)$', line_stripped)

-        # 检查是否进入或退出特殊章节
+        # 检查是否进入或退出特殊章节： 防止如'5、 竞争性磋商采购文件的构成'后面若干行出现'一、招标公告''二、投标人须知''七、 xxxx',这种情况，导致程序错误地匹配大标题，但是它们应该视为'5、 竞争性磋商采购文件的构成'后的正文部分！
+        #进入特殊章节：处理到的行都视为当前序号的内容，而不作为新的序号
        if is_heading(line_stripped):
            if any(re.search(pattern, line_stripped) for pattern in special_section_keywords):
                in_special_section = True
            elif in_special_section:
                in_special_section = False

-        # 以下是原有的匹配逻辑
        # 匹配以点号开头并带有数字的情况，例如 '.12.1 内容'
        dot_match = re.match(r'^[．.](\d+(?:[．.]\d+)*)\s*(.+)$', line_stripped)

@ -283,8 +282,9 @@ def parse_text_by_heading(text):
                                                       in_special_section)

        else:
-            # 根据预先设置的标志决定是否执行这部分代码
+            #前面的都没匹配上，那么当前处理行可能是大标题'一、总则' 或者是普通正文部分'应通知采购代理机构补全或更换，否则风险自负。'
            if has_initial_heading_patterns and not skip_subheadings and not in_special_section:
+                # 匹配上任意一种大标题形式：'一、xx' '（一）、xx' 'A.xx' 且不处于'特殊章节' ，继续执行后面的代码
                numbered_match = pattern_numbered.match(line_stripped)  # 一、
                parentheses_match = pattern_parentheses.match(line_stripped)  # （一）
                if i < 5:
@ -368,7 +368,7 @@ def parse_text_by_heading(text):
                        append_newline = handle_content_append(current_content, line_stripped, append_newline, keywords,
                                                               in_special_section)
            else:
-                # 在特殊章节中，所有内容都作为当前标题的内容
+                #该情况下，所有内容都作为当前标题的内容，添加到current_content中
                if line_stripped:
                    append_newline = handle_content_append(current_content, line_stripped, append_newline, keywords,
                                                           in_special_section)
--- a/flask_app/general/无效标和废标公共代码.py
+++ b/flask_app/general/无效标和废标公共代码.py
@ -74,8 +74,11 @@ def clean_dict_datas(extracted_contents, keywords, excludes):  # 让正则表达

    return all_text1, all_text2 # all_texts1要额外用gpt   all_text2直接返回结果

-#处理跨页的段落
+
 def preprocess_paragraphs(elements):
+    '''
+    处理跨页的段落,程序逻辑判断两个段落能否合并在一起。
+    '''
    processed = []  # 初始化处理后的段落列表
    index = 0
    flag = False  # 初始化标志位
@ -437,8 +440,10 @@ def split_cell_text(text):
    # print(split_sentences)
    return split_sentences

-# 文件预处理----按文件顺序提取文本和表格,并合并跨页表格
 def extract_file_elements(file_path):
+    '''
+    文件预处理----按文件顺序提取文本和表格,并合并跨页表格
+    '''
    doc = Document(file_path)
    doc_elements = doc.element.body
    doc_paragraphs = doc.paragraphs
--- a/flask_app/logger_setup.py
+++ b/flask_app/logger_setup.py
@ -71,4 +71,4 @@ def create_logger(app, subfolder):
        logger.setLevel(logging.INFO)
        logger.propagate = False
    g.logger = logger
-    g.output_folder = output_folder
+    g.output_folder = output_folder    #输出文件夹路径
--- a/flask_app/old_version/解析old_old.py
+++ b/flask_app/old_version/解析old_old.py
@ -11,7 +11,7 @@ from flask_app.工程标.无效标和废标和禁止投标整合 import combine_
 from flask_app.工程标.投标人须知正文提取指定内容工程标 import extract_from_notice
 import concurrent.futures
 from flask_app.工程标.基础信息整合工程标 import combine_basic_info
-from flask_app.工程标.资格审查模块 import combine_review_standards
+from flask_app.工程标.资格审查模块main import combine_review_standards
 from flask_app.old_version.商务评分技术评分整合old_version import combine_evaluation_standards
 from flask_app.general.format_change import pdf2docx, docx2pdf,doc2docx
 from flask_app.general.docx截取docx import copy_docx
--- a/flask_app/routes/get_deviation.py
+++ b/flask_app/routes/get_deviation.py
@ -13,7 +13,7 @@ get_deviation_bp = Blueprint('get_deviation', __name__)
@get_deviation_bp.route('/get_deviation', methods=['POST'])
@validate_and_setup_logger
@require_connection_limit(timeout=720)
-def get_deviation():
+def get_deviation():                  #提供商务、技术偏离的数据
    logger = g.logger
    unique_id = g.unique_id
    file_url = g.file_url
--- a/flask_app/routes/judge_zbfile.py
+++ b/flask_app/routes/judge_zbfile.py
@ -16,7 +16,7 @@ class JudgeResult(Enum):
@judge_zbfile_bp.route('/judge_zbfile', methods=['POST'])
@validate_and_setup_logger
 # @require_connection_limit(timeout=30)
-def judge_zbfile() -> Any:
+def judge_zbfile() -> Any:              #判断是否是招标文件
    """
    主函数，调用 wrapper 并设置整个接口的超时时时间。如果超时返回默认值。
    """
--- a/flask_app/routes/little_zbparse.py
+++ b/flask_app/routes/little_zbparse.py
@ -13,7 +13,7 @@ little_zbparse_bp = Blueprint('little_zbparse', __name__)
@little_zbparse_bp.route('/little_zbparse', methods=['POST'])
@validate_and_setup_logger
@require_connection_limit(timeout=300)
-def little_zbparse():
+def little_zbparse():                    #小解析
    logger = g.logger
    file_url = g.file_url
    zb_type = g.zb_type
--- a/flask_app/routes/upload.py
+++ b/flask_app/routes/upload.py
@ -15,7 +15,7 @@ upload_bp = Blueprint('upload', __name__)
@upload_bp.route('/upload', methods=['POST'])
@validate_and_setup_logger
@require_connection_limit(timeout=720)
-def zbparse():
+def zbparse():                               #大解析
    logger = g.logger
    try:
        logger.info("大解析开始!!!")
@ -25,7 +25,7 @@ def zbparse():
        zb_type = g.zb_type
        try:
            logger.info("starting parsing url:" + file_url)
-            return process_and_stream(file_url, zb_type)
+            return process_and_stream(file_url, zb_type)     #主要执行函数
        except Exception as e:
            logger.error('Exception occurred: ' + str(e))
            if hasattr(g, 'unique_id'):
@ -89,12 +89,12 @@ def process_and_stream(file_url, zb_type):
        good_list = None

        processing_functions = {
-            1: engineering_bid_main,
-            2: goods_bid_main
+            1: engineering_bid_main,    #工程标解析
+            2: goods_bid_main           #货物标解析/服务标解析
        }
        processing_func = processing_functions.get(zb_type, goods_bid_main)

-        for data in processing_func(output_folder, downloaded_filepath, file_type, unique_id):
+        for data in processing_func(output_folder, downloaded_filepath, file_type, unique_id):    #逐一接收货物标 工程标解析内容，为前端网页展示服务
            if not data.strip():
                logger.error("Received empty data, skipping JSON parsing.")
                continue
@ -117,7 +117,7 @@ def process_and_stream(file_url, zb_type):
                yield sse_format(error_response)
                return  # 终止进一步处理

-            if 'good_list' in parsed_data:
+            if 'good_list' in parsed_data:    #货物列表
                good_list = parsed_data['good_list']
                logger.info("Collected good_list from the processing function: %s", good_list)
                continue
@ -131,20 +131,22 @@ def process_and_stream(file_url, zb_type):
                status='success',
                data=data
            )
-            yield sse_format(response)
+            yield sse_format(response)     #返回给后端->前端展示

        base_end_time = time.time()
        logger.info(f"分段解析完成，耗时：{base_end_time - start_time:.2f} 秒")
-
+        #此时前端已完整接收到解析的所有内容，后面的内容与前端展示无关，主要是后处理：1.extracted_result，关键信息存储 2.技术偏离表 3.商务偏离表 4.投标人需提交的证明材料（目前后端存储了，前端还未展示）
+        #后处理开始！！！
        output_json_path = os.path.join(output_folder, 'final_result.json')
        extracted_info_path = os.path.join(output_folder, 'extracted_result.json')
        includes = ["基础信息", "资格审查", "商务评分", "技术评分", "无效标与废标项", "投标文件要求", "开评定标流程"]
        final_result, extracted_info, tech_deviation, tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation, proof_materials = outer_post_processing(
-            combined_data, includes, good_list)
+            combined_data, includes, good_list)    #后处理 生成 extracted_info、商务 技术偏离数据 以及证明材料返给后端

+        #后处理完毕！后面都是生成响应返回，不额外修改数据
        tech_deviation_response, tech_deviation_star_response, zigefuhe_deviation_response, shangwu_deviation_response, shangwu_star_deviation_response, proof_materials_response = generate_deviation_response(
            tech_deviation, tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation,
-            proof_materials, logger)
+            proof_materials, logger)    #生成规范的响应

        # 使用通用响应函数
        yield sse_format(tech_deviation_response)
@ -184,7 +186,7 @@ def process_and_stream(file_url, zb_type):
        )
        yield sse_format(complete_response)

-        final_response = create_response(
+        final_response = create_response(     #目前后端的逻辑是读取到'data'中有个'END',就终止连接
            message='文件上传并处理成功',
            status='success',
            data='END'
--- a/flask_app/routes/工程标解析main.py
+++ b/flask_app/routes/工程标解析main.py
@ -16,7 +16,7 @@ from flask_app.general.无效标和废标公共代码 import combine_find_invali
 from flask_app.general.投标人须知正文提取指定内容 import extract_from_notice
 import concurrent.futures
 from flask_app.工程标.基础信息整合工程标 import combine_basic_info
-from flask_app.工程标.资格审查模块 import combine_review_standards
+from flask_app.工程标.资格审查模块main import combine_review_standards
 from flask_app.general.商务技术评分提取 import combine_evaluation_standards
 from flask_app.general.format_change import pdf2docx, docx2pdf

--- a/flask_app/routes/货物标解析main.py
+++ b/flask_app/routes/货物标解析main.py
@ -34,9 +34,8 @@ def preprocess_files(output_folder, file_path, file_type,logger):

    # 调用截取PDF多次
    truncate_files = truncate_pdf_multiple(pdf_path, output_folder,logger,'goods')  # index: 0->商务技术服务要求 1->评标办法 2->资格审查 3->投标人须知前附表 4->投标人须知正文
-
    # 处理各个部分
-    invalid_path = truncate_files[6] if truncate_files[6] != "" else pdf_path    #无效标
+    invalid_path = truncate_files[6] if truncate_files[6] != "" else pdf_path    #无效标（投标文件格式\合同条款之前的内容）

    invalid_added_pdf = insert_mark(invalid_path)
    invalid_added_docx = pdf2docx(invalid_added_pdf)  # 有标记的invalid_path
@ -141,6 +140,7 @@ def fetch_invalid_requirements(invalid_added_docx, output_folder, logger):
        result = {"无效标与废标": {}}
    return result

+#投标文件要求
 def fetch_bidding_documents_requirements(invalid_deleted_docx, merged_baseinfo_path, clause_path, logger):
    logger.info("starting 投标文件要求...")
    if not merged_baseinfo_path:
@ -216,22 +216,28 @@ def goods_bid_main(output_folder, file_path, file_type, unique_id):
    with concurrent.futures.ThreadPoolExecutor() as executor:
        # 立即启动不依赖 knowledge_name 和 index 的任务
        futures = {
-            'evaluation_standards': executor.submit(fetch_evaluation_standards,processed_data['invalid_deleted_docx'],
+            'evaluation_standards': executor.submit(fetch_evaluation_standards,processed_data['invalid_deleted_docx'],   #技术评分 商务评分
                                                    processed_data['evaluation_method_path'],logger),
-            'invalid_requirements': executor.submit(fetch_invalid_requirements, processed_data['invalid_added_docx'],
+
+            'invalid_requirements': executor.submit(fetch_invalid_requirements, processed_data['invalid_added_docx'],   #无效标与废标项
                                                    output_folder,logger),
+
            'bidding_documents_requirements': executor.submit(fetch_bidding_documents_requirements,processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],
-                                                              processed_data['clause_path'],logger),
-            'opening_bid': executor.submit(fetch_bid_opening, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],processed_data['clause_path'],logger),
-            'base_info': executor.submit(fetch_project_basic_info, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],
+                                                              processed_data['clause_path'],logger),    #投标文件要求
+
+            'opening_bid': executor.submit(fetch_bid_opening, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],
+                                           processed_data['clause_path'],logger),    #开评定标流程
+
+            'base_info': executor.submit(fetch_project_basic_info, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],   #基础信息
                                         processed_data['procurement_path'],processed_data['clause_path'],logger),
-            'qualification_review': executor.submit(fetch_qualification_review, processed_data['invalid_deleted_docx'],
+
+            'qualification_review': executor.submit(fetch_qualification_review, processed_data['invalid_deleted_docx'],      #资格审查
                                                    processed_data['qualification_path'],
                                                    processed_data['notice_path'],logger),
        }

        # 提前处理这些不依赖的任务，按完成顺序返回
-        for future in concurrent.futures.as_completed(futures.values()):
+        for future in concurrent.futures.as_completed(futures.values()):    #as_completed：哪个先运行结束就先返回
            key = next(k for k, v in futures.items() if v == future)
            try:
                result = future.result()
@ -244,8 +250,8 @@ def goods_bid_main(output_folder, file_path, file_type, unique_id):
                    technical_standards = result["technical_standards"]
                    commercial_standards = result["commercial_standards"]
                    # 分别返回技术标和商务标
-                    yield json.dumps({'technical_standards': transform_json_values(technical_standards)},ensure_ascii=False)
-                    yield json.dumps({'commercial_standards': transform_json_values(commercial_standards)},ensure_ascii=False)
+                    yield json.dumps({'technical_standards': transform_json_values(technical_standards)},ensure_ascii=False)  #技术评分
+                    yield json.dumps({'commercial_standards': transform_json_values(commercial_standards)},ensure_ascii=False) #商务评分
                else:
                    # 处理其他任务的结果
                    yield json.dumps({key: transform_json_values(result)}, ensure_ascii=False)
--- a/flask_app/start_up.py
+++ b/flask_app/start_up.py
@ -31,6 +31,7 @@ def create_app():

    @app.teardown_request
    def teardown_request(exception):
+        #接口请求之后都会执行该代码，做一些清理工作
        output_folder = getattr(g, 'output_folder', None)
        if output_folder:
            # 执行与output_folder相关的清理操作（例如删除临时文件）
--- a/flask_app/static/提示词/小解析基本信息工程标.txt
+++ b/flask_app/static/提示词/小解析基本信息工程标.txt
@ -29,7 +29,7 @@
    }
 }

-6.请从提供的招标文件中提取与“信息公示媒介”相关的信息（如补充说明、文件澄清、评标结果等）公示媒介（如网址、官网）。按以下 JSON 格式输出信息，其中键名为“信息公示媒介”，键值为一个字符串列表。如果存在多个信息公示媒介，请分别将原文中相关表述逐字添加至字符串中。注意，若只有一个信息公示媒介，字符串列表中只包含一个字符串。示例输出格式如下，仅供参考：
+6.请从提供的招标文件中提取与“信息公示媒介”相关的信息，即补充说明、文件澄清、评标结果等信息的公示媒介（如网址、官网）。按以下 JSON 格式输出信息，其中键名为“信息公示媒介”，键值为一个字符串列表。如果存在多个信息公示媒介，请分别将原文中相关表述逐字添加至字符串中。注意，若只有一个信息公示媒介，字符串列表中只包含一个字符串。示例输出格式如下，仅供参考：
 {
    "信息公示媒介":["招标公告在政府采购网（www.test.gov.cn）发布。","中标结果将在采购网(www.test.bid.cn)予以公告。"]
 }
--- a/flask_app/static/提示词/小解析基本信息货物标.txt
+++ b/flask_app/static/提示词/小解析基本信息货物标.txt
@ -29,7 +29,7 @@
    }
 }

-6.请从提供的招标文件中提取与“信息公示媒介”相关的信息（如补充说明、文件澄清、评标结果等）公示媒介（如网址、官网）。按以下 JSON 格式输出信息，其中键名为“信息公示媒介”，键值为一个字符串列表。如果存在多个信息公示媒介，请分别将原文中相关表述逐字添加至字符串中。注意，若只有一个信息公示媒介，字符串列表中只包含一个字符串。示例输出格式如下，仅供参考：
+6.请从提供的招标文件中提取与“信息公示媒介”相关的信息，即补充说明、文件澄清、评标结果等信息的公示媒介（如网址、官网）。按以下 JSON 格式输出信息，其中键名为“信息公示媒介”，键值为一个字符串列表。如果存在多个信息公示媒介，请分别将原文中相关表述逐字添加至字符串中。注意，若只有一个信息公示媒介，字符串列表中只包含一个字符串。示例输出格式如下，仅供参考：
 {
    "信息公示媒介":["招标公告在政府采购网（www.test.gov.cn）发布。","中标结果将在采购网(www.test.bid.cn)予以公告。"]
 }
--- a/flask_app/工程标/截取pdf工程标版.py
+++ b/flask_app/工程标/截取pdf工程标版.py
@ -188,8 +188,7 @@ def truncate_pdf_main_engineering(input_path, output_folder, selection, logger,
                    # 投标人须知
                    path1, path2 = extract_pages_tobidders_notice(pdf_path, output_folder, begin_page, common_header, invalid_endpage)
                    return [path1 or "", path2 or ""]
-                elif selection == 5:
-                    #无效标（投标文件格式或合同条款前的内容）
+                elif selection == 5:    #除去投标文件格式之前的内容 或者 合同条款之前的内容
                    invalid_path, end_page = get_invalid_file(pdf_path, output_folder, common_header, begin_page)
                    return [invalid_path or "", end_page]
                else:
--- a/flask_app/工程标/资格审查模块main.py
+++ b/flask_app/工程标/资格审查模块main.py
--- a/flask_app/货物标/基础信息解析货物标版.py
+++ b/flask_app/货物标/基础信息解析货物标版.py
@ -40,11 +40,11 @@ def combine_basic_info(merged_baseinfo_path, procurement_path,clause_path,invali
    temp_list = []
    procurement_reqs = {}
    # 定义一个线程函数来获取基础信息
-    def get_base_info_thread():
+    def get_base_info_thread():         #传统的基础信息提取
        nonlocal temp_list
        temp_list = get_base_info(merged_baseinfo_path,clause_path,invalid_path)
    # 定义一个线程函数来获取采购需求
-    def fetch_procurement_reqs_thread():
+    def fetch_procurement_reqs_thread():   #采购要求提取
        nonlocal procurement_reqs
        procurement_reqs = fetch_procurement_reqs(procurement_path,invalid_path)
    # 创建并启动获取基础信息的线程
--- a/flask_app/货物标/截取pdf货物标版.py
+++ b/flask_app/货物标/截取pdf货物标版.py
@ -216,10 +216,10 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
                    6: 0  # Added default for selection 6 if needed
                }.get(selection, 0)
                # 根据选择设置对应的模式和结束模式
-                if selection == 1:
+                if selection == 1:           #招标公告
                    path=get_notice(pdf_path, output_folder, begin_page,common_header, invalid_endpage)
                    return [path or ""]
-                elif selection == 2:
+                elif selection == 2:          #评标方法
                    begin_pattern = regex.compile(
                        r'^第[一二三四五六七八九十]+(?:章|部分)\s*'
                        r'(?<!"\s*)(?<!“\s*)(?<!”\s*)(?=.*(?:磋商(?=.*(?:办法|方法|内容))|'
@ -230,7 +230,7 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
                        r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+',regex.MULTILINE
                    )
                    local_output_suffix = "evaluation_method"
-                elif selection == 3:
+                elif selection == 3:         #资格审查
                    begin_pattern = regex.compile(
                        r'^第[一二三四五六七八九十百千]+(?:章|部分).*?(资格审查).*', regex.MULTILINE
                    )
@ -238,10 +238,10 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
                        r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+', regex.MULTILINE
                    )
                    local_output_suffix = "qualification1"
-                elif selection == 4:
+                elif selection == 4:              #投标人须知：前附表+须知正文
                    path1, path2 = extract_pages_tobidders_notice(pdf_path, output_folder, begin_page, common_header, invalid_endpage)
                    return [path1 or "", path2 or ""]
-                elif selection == 5:
+                elif selection == 5:                #采购需求
                    begin_pattern = regex.compile(
                        r'^第[一二三四五六七八九十百千]+(?:章|部分).*?(?:服务|项目|商务|技术|供货).*?要求|'
                        r'^第[一二三四五六七八九十百千]+(?:章|部分)(?!.*说明).*(?:采购.*?(?:内容|要求|需求)|(招标|项目)(?:内容|要求|需求)).*|'
@ -251,7 +251,7 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
                        r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+',regex.MULTILINE
                    )
                    local_output_suffix = "procurement"
-                elif selection == 6:
+                elif selection == 6:       #除去投标文件格式之前的内容 或者 合同条款之前的内容
                    invalid_path, end_page = get_invalid_file(pdf_path, output_folder, common_header, begin_page)
                    return [invalid_path or "", end_page]
                else:
--- a/flask_app/货物标/技术参数要求提取.py
+++ b/flask_app/货物标/技术参数要求提取.py
@ -656,7 +656,7 @@ def get_technical_requirements(invalid_path, processed_filepath, model_type=1):
    else:
        # 第一步：收集需要调用 `continue_answer` 的问题和解析结果
        questions_to_continue = []  # 存储需要调用 continue_answer 的 (question, parsed)
-        max_tokens = 3900 if model_type == 1 else 5900
+        max_tokens = 8100 if model_type == 1 else 5900    #plus的max_tokens为8192 qianwen-long为6000，这里稍微取小一点，如果换doubao只有4000！！！
        for question, response in results:
            message = response[0]
            parsed = clean_json_string(message)
@ -674,7 +674,7 @@ def get_technical_requirements(invalid_path, processed_filepath, model_type=1):
    # 更新原始采购需求字典
    final_res = combine_and_update_results(modified_data, temp_final)
    ffinal_res = main_postprocess(final_res)
-    ffinal_res["货物列表"] = good_list
+    ffinal_res["货物列表"] = good_list       #这里会将需要采购的货物列表带出来
    # 输出最终的 JSON 字符串
    return {"采购需求": ffinal_res}

--- a/flask_app/货物标/提取采购需求main.py
+++ b/flask_app/货物标/提取采购需求main.py
@ -12,8 +12,8 @@ def fetch_procurement_reqs(procurement_path, invalid_path):
    #procurement_path可能是pdf\docx
    # 定义默认的 procurement_reqs 字典
    DEFAULT_PROCUREMENT_REQS = {
-        "采购需求": {},
-        "技术要求": [],
+        "采购需求": {},      #对具体的货物采购要求 技术参数
+        "技术要求": [],      #对供应商的技术、商务 服务要求 ，而不是对具体货物
        "商务要求": [],
        "服务要求": [],
        "其他要求": []
@ -46,9 +46,9 @@ def fetch_procurement_reqs(procurement_path, invalid_path):
        # 使用 ThreadPoolExecutor 并行处理 get_technical_requirements 和 get_business_requirements
        with concurrent.futures.ThreadPoolExecutor() as executor:
            # 提交任务给线程池
-            future_technical = executor.submit(get_technical_requirements, invalid_path, processed_filepath, tech_model_type)
+            future_technical = executor.submit(get_technical_requirements, invalid_path, processed_filepath, tech_model_type)   #采购需求
            time.sleep(0.5)  # 保持原有的延时
-            future_business = executor.submit(get_business_requirements, procurement_path, processed_filepath, busi_model_type)
+            future_business = executor.submit(get_business_requirements, procurement_path, processed_filepath, busi_model_type)   #技术、商务、服务、其他要求
            # 获取并行任务的结果
            technical_requirements = future_technical.result()
            business_requirements = future_business.result()
--- a/md_files/0.png
+++ b/md_files/0.png
--- a/md_files/1.png
+++ b/md_files/1.png
--- a/md_files/10.png
+++ b/md_files/10.png
--- a/md_files/11.png
+++ b/md_files/11.png
--- a/md_files/12.png
+++ b/md_files/12.png
--- a/md_files/13.png
+++ b/md_files/13.png
--- a/md_files/14.png
+++ b/md_files/14.png
--- a/md_files/16.png
+++ b/md_files/16.png
--- a/md_files/17.png
+++ b/md_files/17.png
--- a/md_files/2.png
+++ b/md_files/2.png
--- a/md_files/3.png
+++ b/md_files/3.png
--- a/md_files/4.png
+++ b/md_files/4.png
--- a/md_files/5.png
+++ b/md_files/5.png
--- a/md_files/6.png
+++ b/md_files/6.png
--- a/md_files/7.png
+++ b/md_files/7.png
--- a/md_files/8.png
+++ b/md_files/8.png
--- a/md_files/9.png
+++ b/md_files/9.png