2.7 添加注释
261
README.md
Normal file
@ -0,0 +1,261 @@
|
||||
项目地址:[zy123/zbparse - zbparse - 智标领航代码仓库](http://47.98.59.178:3000/zy123/zbparse)
|
||||
|
||||
git clone地址:http://47.98.59.178:3000/zy123/zbparse.git
|
||||
|
||||
选择develop分支,develop-xx 后面的xx越近越新。
|
||||
|
||||
正式环境:121.41.119.164:5000
|
||||
|
||||
测试环境:47.98.58.178:5000
|
||||
|
||||
大解析:指从招标文件解析入口进去,upload.py
|
||||
|
||||
小解析:从投标文件生成入口进去,little_zbparse 和get_deviation,两个接口后端一起调
|
||||
|
||||
## 项目结构:
|
||||
|
||||

|
||||
|
||||
.env存放一些密钥(大模型、textin等),它是gitignore忽略了,因此在服务器上git pull项目的时候,这个文件不会更新(因为密钥比较重要),需要手动维护服务器相应位置的.env。
|
||||
|
||||
**如何更新服务器上的版本:**
|
||||
|
||||
1. 进入项目文件夹
|
||||
|
||||

|
||||
|
||||
**注意:**需要确认.env是否存在在服务器,默认是隐藏的
|
||||
输入cat .env
|
||||
如果不存在,在项目文件夹下sudo vim .env
|
||||
|
||||
将密钥粘贴进去!!!
|
||||
|
||||
2. git pull
|
||||
|
||||
3. sudo docker-compose up --build -d 更新并重启
|
||||
|
||||
或者 sudo docker-compose build 先构建镜像
|
||||
|
||||
sudo docker-compose up -d 等空间时再重启
|
||||
|
||||
4. sudo docker-compose logs flask_app --since 1h 查看最近1h的日志(如果重启后报错也能查看,推荐重启后都运行一下这个)
|
||||
|
||||
|
||||
|
||||
requirements.txt一般无需变动,除非代码中使用了新的库,也要手动在该文件中添加包名及对应的版本
|
||||
|
||||
|
||||
|
||||
**如何本地启动本项目:**
|
||||
|
||||
1. requirements.txt里的环境要配好
|
||||
2. .env环境配好 (一般不需要在电脑环境变量中额外配置了)
|
||||
|
||||
3. 点击下拉框,Edit configurations
|
||||
|
||||

|
||||
|
||||
设置run_serve.py为启动脚本
|
||||
|
||||
4. postman打post请求测试:
|
||||
|
||||
http://127.0.0.1:5000/upload
|
||||
|
||||
body:
|
||||
|
||||
{
|
||||
|
||||
"file_url":"xxxx",
|
||||
|
||||
"zb_type":2
|
||||
|
||||
}
|
||||
|
||||
|
||||
|
||||
## flask_app结构介绍
|
||||
|
||||
### general
|
||||
|
||||
是公共函数存放的文件夹,llm下是各类大模型,读取文件下是docx pdf文件的读取以及文档清理clean_pdf,去页眉页脚页码
|
||||
|
||||

|
||||
|
||||
general下的llm下的清除file_id.py 需要**每周运行至少一次**,防止file_id数量超出(我这边对每次请求结束都有file_id记录并清理,向应该还没加)
|
||||
|
||||
llm下的model_continue_query是'模型继续回答'脚本,应对超长文本模型一次无法输出完的情况,继续提问,拼接成完整的内容。
|
||||
|
||||
|
||||
|
||||
general下的file2markdown是textin 文件--》markdown
|
||||
|
||||
general下的format_change是pdf-》docx 或doc/docx->pdf
|
||||
|
||||
general下的merge_pdfs.py是拼接文件的:1.拼接招标公告+投标人须知 2.拼接评标细则章节+资格审查章节
|
||||
|
||||
|
||||
|
||||
**general中比较重要的!!!**
|
||||
|
||||
**后处理:**
|
||||
|
||||
general下的**post_processing**,解析后的后处理部分,包括extract_info、 资格审查、技术偏离 商务偏离 所需提交的证明材料,都在这块生成。
|
||||
|
||||
post_processing中的**inner_post_processing**专门提取*extracted_info*
|
||||
|
||||
post_processing中的**process_functions_in_parallel**提取
|
||||
|
||||
资格审查、技术偏离、 商务偏离、 所需提交的证明材料
|
||||
|
||||

|
||||
|
||||
大解析upload用了post_processing完整版,
|
||||
|
||||
little_zbparse.py、小解析main.py用了inner_post_processing
|
||||
|
||||
get_deviation.py、偏离表数据解析main.py用了process_functions_in_parallel
|
||||
|
||||
|
||||
|
||||
**截取pdf:**
|
||||
|
||||
*截取pdf_main.py*是顶级函数,
|
||||
|
||||
二级是*截取pdf货物标版*.py和*截取pdf工程标版.py* (非general下)
|
||||
|
||||
三级是*截取pdf通用函数.py*
|
||||
|
||||
|
||||
|
||||
**无效标和废标公共代码**
|
||||
|
||||
获取无效标与废标项的主要执行代码。对docx文件进行预处理=》正则=》temp.txt=》大模型筛选
|
||||
如果提的不全,可能是正则没涵盖到位,也可能是大模型提示词漏选了。
|
||||
|
||||
这里:如果段落中既被正则匹配,又被follow_up_keywords中的任意一个匹配,那么不会添加到temp中(即不会被大模型筛选),它会**直接添加**到最后的返回中!
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
**投标人须知正文条款提取成json文件**
|
||||
|
||||
将截取到的ztbfile_tobidders_notice_part2.pdf ,即须知正文,转为clause1.json 文件,便于后续提取**开评定标流程**、**投标文件要求**、**重新招标、不再招标和终止招标**
|
||||
|
||||
这块的主要逻辑就是匹配形如'一、总则'这样的大章节
|
||||
|
||||
然后匹配形如'1.1' '1.1.1'这样的序号,由于是按行读取pdf,一个序号后面的内容可能有好几行,因此遇到下一个序号(如'2.1')开头,之前的内容都视为上一个序号的。
|
||||
|
||||
|
||||
|
||||
### old_version
|
||||
|
||||
都是废弃文件代码,未在正式、测试环境中使用的,不用管
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
### routes
|
||||
|
||||
是接口以及主要实现部分,一一对应
|
||||
|
||||

|
||||
|
||||
get_deviation对应偏离表数据解析main,获得偏离表数据
|
||||
|
||||
judge_zbfile对应判断是否是招标文件
|
||||
|
||||
little_zbparse对应小解析main,负责解析extract_info
|
||||
|
||||
test_zbparse是测试接口,无对应
|
||||
|
||||
upload对应工程标解析和货物标解析,即大解析
|
||||
|
||||
**混淆澄清**:小解析可以指代一个过程,即从'投标文件生成'这个入口进去的解析,后端会同时调用little_zbparse和get_deviation。这个过程称为'小解析'。
|
||||
|
||||
但是little_zbparse也叫小解析,命名如此因为最初只需返回这些数据(extract_info),后续才陆续返回商务、技术偏离...
|
||||
|
||||
|
||||
|
||||
utils是接口这块的公共功能函数。其中validate_and_setup_logger函数对不同的接口请求对应到不同的output文件夹,如upload->output1。后续增加接口也可直接在这里写映射关系。
|
||||
|
||||

|
||||
|
||||
重点关注大解析:**upload.py**和**货物标解析main.py**
|
||||
|
||||
|
||||
|
||||
### static
|
||||
|
||||
存放解析的输出和提示词
|
||||
|
||||
其中output用gitignore了,git push不会推送这块内容。
|
||||
|
||||
各个文件夹(output1 output2..)对应不同的接口请求
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
### test_case&testdir
|
||||
|
||||
test_case是测试用例,是对一些函数的测试。好久没更新了
|
||||
|
||||
testdir是平时写代码的测试的地方
|
||||
|
||||
它们都不影响正式和测试环境的解析
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
### 工程标&货物标
|
||||
|
||||
是两个解析流程中不一样的地方(一样的都写在**general**中了)
|
||||
|
||||

|
||||
|
||||
主要是货物标额外解析了采购要求(提取采购需求main+技术参数要求提取+商务服务其他要求提取)
|
||||
|
||||
|
||||
|
||||
### 最后:
|
||||
|
||||
ConnectionLimiter.py定义了接口超时时间->超时后断开与后端的连接
|
||||
|
||||

|
||||
|
||||
logger_setup.py 为每个请求创建单独的log,每个log对应一个log.txt
|
||||
|
||||
start_up.py是启动脚本,run_serve也是启动脚本,是对start_up.py的简单封装,目前dockerfile定义的直接使用run_serve启动
|
||||
|
||||
|
||||
|
||||
## 持续关注
|
||||
|
||||
```
|
||||
yield sse_format(tech_deviation_response)
|
||||
yield sse_format(tech_deviation_star_response)
|
||||
yield sse_format(zigefuhe_deviation_response)
|
||||
yield sse_format(shangwu_deviation_response)
|
||||
yield sse_format(shangwu_star_deviation_response)
|
||||
yield sse_format(proof_materials_response)
|
||||
```
|
||||
|
||||
1. 工程标解析目前仍没有解析采购要求这一块,因此后处理返回的只有'资格审查'和''证明材料"和"extracted_info",没有''商务偏离''及'商务带星偏离',也没有'技术偏离'和'技术带星偏离',而货物标解析是完全版。
|
||||
|
||||
其中''证明材料"和"extracted_info"是直接返给后端保存的
|
||||
|
||||
2. 大解析中返回了技术评分,后端接收后不仅显示给前端,还会返给向,用于生成技术偏离表
|
||||
3. 小解析时,get_deviation.py其实也可以返回技术评分,但是没有返回,因为没人和我对接,暂时注释了。
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
4.商务评议和技术评议偏离表,即评分细则的偏离表,暂时没做,但是**商务评分、技术评分**无论大解析还是小解析都解析了,稍微对该数据处理一下返回给后端就行。
|
||||
|
||||

|
||||
|
||||
这个是解析得来的结果,适合给前端展示,但是要生成商务技术评议偏离表的话,需要再调一次大模型,对该数据进行重新归纳,以字符串列表为佳。再传给后端。(未做)
|
@ -266,7 +266,8 @@ def outer_post_processing(combined_data, includes, good_list):
|
||||
if "基础信息" in includes:
|
||||
base_info = combined_data.get("基础信息", {})
|
||||
# 调用内层 inner_post_processing 处理 '基础信息'
|
||||
extracted_info = inner_post_processing(base_info)
|
||||
extracted_info = inner_post_processing(base_info) #生成extract_info,返回给后端
|
||||
|
||||
# 将 '基础信息' 保留在处理后的数据中
|
||||
processed_data["基础信息"] = base_info
|
||||
# 提取 '采购要求' 下的 '采购需求'
|
||||
@ -291,7 +292,7 @@ def outer_post_processing(combined_data, includes, good_list):
|
||||
busi_eval=combined_data.get("商务评分",{})
|
||||
busi_eval_info=json.dumps(busi_eval,ensure_ascii=False,indent=4)
|
||||
all_data_info = '\n'.join([zige_info, fuhe_info, zigefuhe_info, tech_deviation_info,busi_requirements_info, tech_eval_info,busi_eval_info])
|
||||
tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation,proof_materials = process_functions_in_parallel(
|
||||
tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation,proof_materials = process_functions_in_parallel( #生成技术 商务偏离的main函数
|
||||
tech_requirements_dict=tech_requirements,
|
||||
busi_requirements_dict=busi_requirements,
|
||||
zige_info=zige_info,
|
||||
|
@ -387,8 +387,8 @@ def extract_from_notice(merged_baseinfo_path, clause_path, type):
|
||||
|
||||
# 映射 type 到 target_values
|
||||
type_target_map = {
|
||||
1: ["投标", "投标文件", "响应文件"],
|
||||
2: ["开标", "评标", "定标", "评审", "成交", "合同", "磋商","谈判","中标", "程序", "步骤"],
|
||||
1: ["投标", "投标文件", "响应文件"], #投标文件要求
|
||||
2: ["开标", "评标", "定标", "评审", "成交", "合同", "磋商","谈判","中标", "程序", "步骤"], #开评定标流程
|
||||
3: ["重新招标、不再招标和终止招标", "重新招标", "重新采购", "不再招标", "不再采购", "终止招标", "终止采购"],
|
||||
4: ["评标"] # 测试
|
||||
}
|
||||
|
@ -10,11 +10,10 @@ from flask_app.general.截取pdf通用函数 import clean_page_content, extract_
|
||||
def compare_headings(current, new):
|
||||
"""
|
||||
比较两个标题的层次关系,并确保新标题比当前标题大且最高位数字差值不超过5。
|
||||
|
||||
因为默认新的序号是比旧的序号大的,同时也要防止一些特别大的序号在行首,如'2025年xxx' 错误地将'2025'匹配成序号了,事实上它只是正文的一部分。
|
||||
参数:
|
||||
current (str): 当前标题,例如 "1.2.3"
|
||||
new (str): 新标题,例如 "1.3"
|
||||
|
||||
返回:
|
||||
bool: 如果新标题大于当前标题且最高位数字差值不超过3,则返回 True,否则返回 False
|
||||
"""
|
||||
@ -80,7 +79,7 @@ def parse_text_by_heading(text):
|
||||
lines = text.split('\n')
|
||||
|
||||
def get_current_number(key_chinese):
|
||||
chinese_to_number = {
|
||||
chinese_to_number = { #便于比较中文序号大小
|
||||
'一': 1, '二': 2, '三': 3, '四': 4, '五': 5,
|
||||
'六': 6, '七': 7, '八': 8, '九': 9, '十': 10,
|
||||
'十一': 11, '十二': 12, '十三': 13, '十四': 14, '十五': 15
|
||||
@ -99,7 +98,7 @@ def parse_text_by_heading(text):
|
||||
patterns = [
|
||||
r'^(?<![a-zA-Z((])(\d+(?:\.\d+)+)\s*(.*)', # 匹配 '12.1 内容'
|
||||
r'^(\d+\.)\s*(.+)$', # 匹配 '12. 内容'
|
||||
r'^[..](\d+(?:[..]\d+)*)\s*(.+)$', # 匹配 '.12.1 内容'
|
||||
r'^[..](\d+(?:[..]\d+)*)\s*(.+)$', # 匹配 '.12.1 内容' 这种点号开头的情况是因为读取pdf时clean_page_content进行了清理,删去了页眉页脚和页码,序号可能会被误判为页码删除,这种情况已在代码中进行了不错的处理。
|
||||
r'^(\d+)([^.\d].*)' # 匹配 '27 内容'
|
||||
]
|
||||
for pattern in patterns:
|
||||
@ -111,9 +110,9 @@ def parse_text_by_heading(text):
|
||||
first_five_lines = lines[:5]
|
||||
has_initial_heading_patterns = False
|
||||
for line in first_five_lines:
|
||||
line_stripped = line.strip().replace('.', '.')
|
||||
if line_stripped.startswith("##"):
|
||||
line_stripped = line_stripped[2:] # Remove "##"
|
||||
line_stripped = line.strip().replace('.', '.') #line_stripped是处理中的当前行
|
||||
if line_stripped.startswith("##"): #预处理时添加的,对每页pdf的首行前打了标记'##' ,便于特殊处理;因为首行的序号往往会因为clean_page_content被错误删除!需补全该序号。
|
||||
line_stripped = line_stripped[2:] # 移除 "##"
|
||||
if (pattern_numbered.match(line_stripped) or pattern_parentheses.match(
|
||||
line_stripped) or pattern_letter_initial.match(line_stripped)):
|
||||
has_initial_heading_patterns = True
|
||||
@ -135,14 +134,14 @@ def parse_text_by_heading(text):
|
||||
if not match:
|
||||
match = re.match(r'^(\d+\.)\s*(.+)$', line_stripped)
|
||||
|
||||
# 检查是否进入或退出特殊章节
|
||||
# 检查是否进入或退出特殊章节: 防止如'5、 竞争性磋商采购文件的构成'后面若干行出现'一、招标公告''二、投标人须知''七、 xxxx',这种情况,导致程序错误地匹配大标题,但是它们应该视为'5、 竞争性磋商采购文件的构成'后的正文部分!
|
||||
#进入特殊章节:处理到的行都视为当前序号的内容,而不作为新的序号
|
||||
if is_heading(line_stripped):
|
||||
if any(re.search(pattern, line_stripped) for pattern in special_section_keywords):
|
||||
in_special_section = True
|
||||
elif in_special_section:
|
||||
in_special_section = False
|
||||
|
||||
# 以下是原有的匹配逻辑
|
||||
# 匹配以点号开头并带有数字的情况,例如 '.12.1 内容'
|
||||
dot_match = re.match(r'^[..](\d+(?:[..]\d+)*)\s*(.+)$', line_stripped)
|
||||
|
||||
@ -283,8 +282,9 @@ def parse_text_by_heading(text):
|
||||
in_special_section)
|
||||
|
||||
else:
|
||||
# 根据预先设置的标志决定是否执行这部分代码
|
||||
#前面的都没匹配上,那么当前处理行可能是大标题'一、总则' 或者是普通正文部分'应通知采购代理机构补全或更换,否则风险自负。'
|
||||
if has_initial_heading_patterns and not skip_subheadings and not in_special_section:
|
||||
# 匹配上任意一种大标题形式:'一、xx' '(一)、xx' 'A.xx' 且不处于'特殊章节' ,继续执行后面的代码
|
||||
numbered_match = pattern_numbered.match(line_stripped) # 一、
|
||||
parentheses_match = pattern_parentheses.match(line_stripped) # (一)
|
||||
if i < 5:
|
||||
@ -368,7 +368,7 @@ def parse_text_by_heading(text):
|
||||
append_newline = handle_content_append(current_content, line_stripped, append_newline, keywords,
|
||||
in_special_section)
|
||||
else:
|
||||
# 在特殊章节中,所有内容都作为当前标题的内容
|
||||
#该情况下,所有内容都作为当前标题的内容,添加到current_content中
|
||||
if line_stripped:
|
||||
append_newline = handle_content_append(current_content, line_stripped, append_newline, keywords,
|
||||
in_special_section)
|
||||
|
@ -74,8 +74,11 @@ def clean_dict_datas(extracted_contents, keywords, excludes): # 让正则表达
|
||||
|
||||
return all_text1, all_text2 # all_texts1要额外用gpt all_text2直接返回结果
|
||||
|
||||
#处理跨页的段落
|
||||
|
||||
def preprocess_paragraphs(elements):
|
||||
'''
|
||||
处理跨页的段落,程序逻辑判断两个段落能否合并在一起。
|
||||
'''
|
||||
processed = [] # 初始化处理后的段落列表
|
||||
index = 0
|
||||
flag = False # 初始化标志位
|
||||
@ -437,8 +440,10 @@ def split_cell_text(text):
|
||||
# print(split_sentences)
|
||||
return split_sentences
|
||||
|
||||
# 文件预处理----按文件顺序提取文本和表格,并合并跨页表格
|
||||
def extract_file_elements(file_path):
|
||||
'''
|
||||
文件预处理----按文件顺序提取文本和表格,并合并跨页表格
|
||||
'''
|
||||
doc = Document(file_path)
|
||||
doc_elements = doc.element.body
|
||||
doc_paragraphs = doc.paragraphs
|
||||
|
@ -71,4 +71,4 @@ def create_logger(app, subfolder):
|
||||
logger.setLevel(logging.INFO)
|
||||
logger.propagate = False
|
||||
g.logger = logger
|
||||
g.output_folder = output_folder
|
||||
g.output_folder = output_folder #输出文件夹路径
|
||||
|
@ -11,7 +11,7 @@ from flask_app.工程标.无效标和废标和禁止投标整合 import combine_
|
||||
from flask_app.工程标.投标人须知正文提取指定内容工程标 import extract_from_notice
|
||||
import concurrent.futures
|
||||
from flask_app.工程标.基础信息整合工程标 import combine_basic_info
|
||||
from flask_app.工程标.资格审查模块 import combine_review_standards
|
||||
from flask_app.工程标.资格审查模块main import combine_review_standards
|
||||
from flask_app.old_version.商务评分技术评分整合old_version import combine_evaluation_standards
|
||||
from flask_app.general.format_change import pdf2docx, docx2pdf,doc2docx
|
||||
from flask_app.general.docx截取docx import copy_docx
|
||||
|
@ -13,7 +13,7 @@ get_deviation_bp = Blueprint('get_deviation', __name__)
|
||||
@get_deviation_bp.route('/get_deviation', methods=['POST'])
|
||||
@validate_and_setup_logger
|
||||
@require_connection_limit(timeout=720)
|
||||
def get_deviation():
|
||||
def get_deviation(): #提供商务、技术偏离的数据
|
||||
logger = g.logger
|
||||
unique_id = g.unique_id
|
||||
file_url = g.file_url
|
||||
|
@ -16,7 +16,7 @@ class JudgeResult(Enum):
|
||||
@judge_zbfile_bp.route('/judge_zbfile', methods=['POST'])
|
||||
@validate_and_setup_logger
|
||||
# @require_connection_limit(timeout=30)
|
||||
def judge_zbfile() -> Any:
|
||||
def judge_zbfile() -> Any: #判断是否是招标文件
|
||||
"""
|
||||
主函数,调用 wrapper 并设置整个接口的超时时时间。如果超时返回默认值。
|
||||
"""
|
||||
|
@ -13,7 +13,7 @@ little_zbparse_bp = Blueprint('little_zbparse', __name__)
|
||||
@little_zbparse_bp.route('/little_zbparse', methods=['POST'])
|
||||
@validate_and_setup_logger
|
||||
@require_connection_limit(timeout=300)
|
||||
def little_zbparse():
|
||||
def little_zbparse(): #小解析
|
||||
logger = g.logger
|
||||
file_url = g.file_url
|
||||
zb_type = g.zb_type
|
||||
|
@ -15,7 +15,7 @@ upload_bp = Blueprint('upload', __name__)
|
||||
@upload_bp.route('/upload', methods=['POST'])
|
||||
@validate_and_setup_logger
|
||||
@require_connection_limit(timeout=720)
|
||||
def zbparse():
|
||||
def zbparse(): #大解析
|
||||
logger = g.logger
|
||||
try:
|
||||
logger.info("大解析开始!!!")
|
||||
@ -25,7 +25,7 @@ def zbparse():
|
||||
zb_type = g.zb_type
|
||||
try:
|
||||
logger.info("starting parsing url:" + file_url)
|
||||
return process_and_stream(file_url, zb_type)
|
||||
return process_and_stream(file_url, zb_type) #主要执行函数
|
||||
except Exception as e:
|
||||
logger.error('Exception occurred: ' + str(e))
|
||||
if hasattr(g, 'unique_id'):
|
||||
@ -89,12 +89,12 @@ def process_and_stream(file_url, zb_type):
|
||||
good_list = None
|
||||
|
||||
processing_functions = {
|
||||
1: engineering_bid_main,
|
||||
2: goods_bid_main
|
||||
1: engineering_bid_main, #工程标解析
|
||||
2: goods_bid_main #货物标解析/服务标解析
|
||||
}
|
||||
processing_func = processing_functions.get(zb_type, goods_bid_main)
|
||||
|
||||
for data in processing_func(output_folder, downloaded_filepath, file_type, unique_id):
|
||||
for data in processing_func(output_folder, downloaded_filepath, file_type, unique_id): #逐一接收货物标 工程标解析内容,为前端网页展示服务
|
||||
if not data.strip():
|
||||
logger.error("Received empty data, skipping JSON parsing.")
|
||||
continue
|
||||
@ -117,7 +117,7 @@ def process_and_stream(file_url, zb_type):
|
||||
yield sse_format(error_response)
|
||||
return # 终止进一步处理
|
||||
|
||||
if 'good_list' in parsed_data:
|
||||
if 'good_list' in parsed_data: #货物列表
|
||||
good_list = parsed_data['good_list']
|
||||
logger.info("Collected good_list from the processing function: %s", good_list)
|
||||
continue
|
||||
@ -131,20 +131,22 @@ def process_and_stream(file_url, zb_type):
|
||||
status='success',
|
||||
data=data
|
||||
)
|
||||
yield sse_format(response)
|
||||
yield sse_format(response) #返回给后端->前端展示
|
||||
|
||||
base_end_time = time.time()
|
||||
logger.info(f"分段解析完成,耗时:{base_end_time - start_time:.2f} 秒")
|
||||
|
||||
#此时前端已完整接收到解析的所有内容,后面的内容与前端展示无关,主要是后处理:1.extracted_result,关键信息存储 2.技术偏离表 3.商务偏离表 4.投标人需提交的证明材料(目前后端存储了,前端还未展示)
|
||||
#后处理开始!!!
|
||||
output_json_path = os.path.join(output_folder, 'final_result.json')
|
||||
extracted_info_path = os.path.join(output_folder, 'extracted_result.json')
|
||||
includes = ["基础信息", "资格审查", "商务评分", "技术评分", "无效标与废标项", "投标文件要求", "开评定标流程"]
|
||||
final_result, extracted_info, tech_deviation, tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation, proof_materials = outer_post_processing(
|
||||
combined_data, includes, good_list)
|
||||
combined_data, includes, good_list) #后处理 生成 extracted_info、商务 技术偏离数据 以及证明材料返给后端
|
||||
|
||||
#后处理完毕!后面都是生成响应返回,不额外修改数据
|
||||
tech_deviation_response, tech_deviation_star_response, zigefuhe_deviation_response, shangwu_deviation_response, shangwu_star_deviation_response, proof_materials_response = generate_deviation_response(
|
||||
tech_deviation, tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation,
|
||||
proof_materials, logger)
|
||||
proof_materials, logger) #生成规范的响应
|
||||
|
||||
# 使用通用响应函数
|
||||
yield sse_format(tech_deviation_response)
|
||||
@ -184,7 +186,7 @@ def process_and_stream(file_url, zb_type):
|
||||
)
|
||||
yield sse_format(complete_response)
|
||||
|
||||
final_response = create_response(
|
||||
final_response = create_response( #目前后端的逻辑是读取到'data'中有个'END',就终止连接
|
||||
message='文件上传并处理成功',
|
||||
status='success',
|
||||
data='END'
|
||||
|
@ -16,7 +16,7 @@ from flask_app.general.无效标和废标公共代码 import combine_find_invali
|
||||
from flask_app.general.投标人须知正文提取指定内容 import extract_from_notice
|
||||
import concurrent.futures
|
||||
from flask_app.工程标.基础信息整合工程标 import combine_basic_info
|
||||
from flask_app.工程标.资格审查模块 import combine_review_standards
|
||||
from flask_app.工程标.资格审查模块main import combine_review_standards
|
||||
from flask_app.general.商务技术评分提取 import combine_evaluation_standards
|
||||
from flask_app.general.format_change import pdf2docx, docx2pdf
|
||||
|
||||
|
@ -34,9 +34,8 @@ def preprocess_files(output_folder, file_path, file_type,logger):
|
||||
|
||||
# 调用截取PDF多次
|
||||
truncate_files = truncate_pdf_multiple(pdf_path, output_folder,logger,'goods') # index: 0->商务技术服务要求 1->评标办法 2->资格审查 3->投标人须知前附表 4->投标人须知正文
|
||||
|
||||
# 处理各个部分
|
||||
invalid_path = truncate_files[6] if truncate_files[6] != "" else pdf_path #无效标
|
||||
invalid_path = truncate_files[6] if truncate_files[6] != "" else pdf_path #无效标(投标文件格式\合同条款之前的内容)
|
||||
|
||||
invalid_added_pdf = insert_mark(invalid_path)
|
||||
invalid_added_docx = pdf2docx(invalid_added_pdf) # 有标记的invalid_path
|
||||
@ -141,6 +140,7 @@ def fetch_invalid_requirements(invalid_added_docx, output_folder, logger):
|
||||
result = {"无效标与废标": {}}
|
||||
return result
|
||||
|
||||
#投标文件要求
|
||||
def fetch_bidding_documents_requirements(invalid_deleted_docx, merged_baseinfo_path, clause_path, logger):
|
||||
logger.info("starting 投标文件要求...")
|
||||
if not merged_baseinfo_path:
|
||||
@ -216,22 +216,28 @@ def goods_bid_main(output_folder, file_path, file_type, unique_id):
|
||||
with concurrent.futures.ThreadPoolExecutor() as executor:
|
||||
# 立即启动不依赖 knowledge_name 和 index 的任务
|
||||
futures = {
|
||||
'evaluation_standards': executor.submit(fetch_evaluation_standards,processed_data['invalid_deleted_docx'],
|
||||
'evaluation_standards': executor.submit(fetch_evaluation_standards,processed_data['invalid_deleted_docx'], #技术评分 商务评分
|
||||
processed_data['evaluation_method_path'],logger),
|
||||
'invalid_requirements': executor.submit(fetch_invalid_requirements, processed_data['invalid_added_docx'],
|
||||
|
||||
'invalid_requirements': executor.submit(fetch_invalid_requirements, processed_data['invalid_added_docx'], #无效标与废标项
|
||||
output_folder,logger),
|
||||
|
||||
'bidding_documents_requirements': executor.submit(fetch_bidding_documents_requirements,processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],
|
||||
processed_data['clause_path'],logger),
|
||||
'opening_bid': executor.submit(fetch_bid_opening, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],processed_data['clause_path'],logger),
|
||||
'base_info': executor.submit(fetch_project_basic_info, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],
|
||||
processed_data['clause_path'],logger), #投标文件要求
|
||||
|
||||
'opening_bid': executor.submit(fetch_bid_opening, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],
|
||||
processed_data['clause_path'],logger), #开评定标流程
|
||||
|
||||
'base_info': executor.submit(fetch_project_basic_info, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'], #基础信息
|
||||
processed_data['procurement_path'],processed_data['clause_path'],logger),
|
||||
'qualification_review': executor.submit(fetch_qualification_review, processed_data['invalid_deleted_docx'],
|
||||
|
||||
'qualification_review': executor.submit(fetch_qualification_review, processed_data['invalid_deleted_docx'], #资格审查
|
||||
processed_data['qualification_path'],
|
||||
processed_data['notice_path'],logger),
|
||||
}
|
||||
|
||||
# 提前处理这些不依赖的任务,按完成顺序返回
|
||||
for future in concurrent.futures.as_completed(futures.values()):
|
||||
for future in concurrent.futures.as_completed(futures.values()): #as_completed:哪个先运行结束就先返回
|
||||
key = next(k for k, v in futures.items() if v == future)
|
||||
try:
|
||||
result = future.result()
|
||||
@ -244,8 +250,8 @@ def goods_bid_main(output_folder, file_path, file_type, unique_id):
|
||||
technical_standards = result["technical_standards"]
|
||||
commercial_standards = result["commercial_standards"]
|
||||
# 分别返回技术标和商务标
|
||||
yield json.dumps({'technical_standards': transform_json_values(technical_standards)},ensure_ascii=False)
|
||||
yield json.dumps({'commercial_standards': transform_json_values(commercial_standards)},ensure_ascii=False)
|
||||
yield json.dumps({'technical_standards': transform_json_values(technical_standards)},ensure_ascii=False) #技术评分
|
||||
yield json.dumps({'commercial_standards': transform_json_values(commercial_standards)},ensure_ascii=False) #商务评分
|
||||
else:
|
||||
# 处理其他任务的结果
|
||||
yield json.dumps({key: transform_json_values(result)}, ensure_ascii=False)
|
||||
|
@ -31,6 +31,7 @@ def create_app():
|
||||
|
||||
@app.teardown_request
|
||||
def teardown_request(exception):
|
||||
#接口请求之后都会执行该代码,做一些清理工作
|
||||
output_folder = getattr(g, 'output_folder', None)
|
||||
if output_folder:
|
||||
# 执行与output_folder相关的清理操作(例如删除临时文件)
|
||||
|
@ -29,7 +29,7 @@
|
||||
}
|
||||
}
|
||||
|
||||
6.请从提供的招标文件中提取与“信息公示媒介”相关的信息(如补充说明、文件澄清、评标结果等)公示媒介(如网址、官网)。按以下 JSON 格式输出信息,其中键名为“信息公示媒介”,键值为一个字符串列表。如果存在多个信息公示媒介,请分别将原文中相关表述逐字添加至字符串中。注意,若只有一个信息公示媒介,字符串列表中只包含一个字符串。示例输出格式如下,仅供参考:
|
||||
6.请从提供的招标文件中提取与“信息公示媒介”相关的信息,即补充说明、文件澄清、评标结果等信息的公示媒介(如网址、官网)。按以下 JSON 格式输出信息,其中键名为“信息公示媒介”,键值为一个字符串列表。如果存在多个信息公示媒介,请分别将原文中相关表述逐字添加至字符串中。注意,若只有一个信息公示媒介,字符串列表中只包含一个字符串。示例输出格式如下,仅供参考:
|
||||
{
|
||||
"信息公示媒介":["招标公告在政府采购网(www.test.gov.cn)发布。","中标结果将在采购网(www.test.bid.cn)予以公告。"]
|
||||
}
|
||||
|
@ -29,7 +29,7 @@
|
||||
}
|
||||
}
|
||||
|
||||
6.请从提供的招标文件中提取与“信息公示媒介”相关的信息(如补充说明、文件澄清、评标结果等)公示媒介(如网址、官网)。按以下 JSON 格式输出信息,其中键名为“信息公示媒介”,键值为一个字符串列表。如果存在多个信息公示媒介,请分别将原文中相关表述逐字添加至字符串中。注意,若只有一个信息公示媒介,字符串列表中只包含一个字符串。示例输出格式如下,仅供参考:
|
||||
6.请从提供的招标文件中提取与“信息公示媒介”相关的信息,即补充说明、文件澄清、评标结果等信息的公示媒介(如网址、官网)。按以下 JSON 格式输出信息,其中键名为“信息公示媒介”,键值为一个字符串列表。如果存在多个信息公示媒介,请分别将原文中相关表述逐字添加至字符串中。注意,若只有一个信息公示媒介,字符串列表中只包含一个字符串。示例输出格式如下,仅供参考:
|
||||
{
|
||||
"信息公示媒介":["招标公告在政府采购网(www.test.gov.cn)发布。","中标结果将在采购网(www.test.bid.cn)予以公告。"]
|
||||
}
|
||||
|
@ -188,8 +188,7 @@ def truncate_pdf_main_engineering(input_path, output_folder, selection, logger,
|
||||
# 投标人须知
|
||||
path1, path2 = extract_pages_tobidders_notice(pdf_path, output_folder, begin_page, common_header, invalid_endpage)
|
||||
return [path1 or "", path2 or ""]
|
||||
elif selection == 5:
|
||||
#无效标(投标文件格式或合同条款前的内容)
|
||||
elif selection == 5: #除去投标文件格式之前的内容 或者 合同条款之前的内容
|
||||
invalid_path, end_page = get_invalid_file(pdf_path, output_folder, common_header, begin_page)
|
||||
return [invalid_path or "", end_page]
|
||||
else:
|
||||
|
@ -40,11 +40,11 @@ def combine_basic_info(merged_baseinfo_path, procurement_path,clause_path,invali
|
||||
temp_list = []
|
||||
procurement_reqs = {}
|
||||
# 定义一个线程函数来获取基础信息
|
||||
def get_base_info_thread():
|
||||
def get_base_info_thread(): #传统的基础信息提取
|
||||
nonlocal temp_list
|
||||
temp_list = get_base_info(merged_baseinfo_path,clause_path,invalid_path)
|
||||
# 定义一个线程函数来获取采购需求
|
||||
def fetch_procurement_reqs_thread():
|
||||
def fetch_procurement_reqs_thread(): #采购要求提取
|
||||
nonlocal procurement_reqs
|
||||
procurement_reqs = fetch_procurement_reqs(procurement_path,invalid_path)
|
||||
# 创建并启动获取基础信息的线程
|
||||
|
@ -216,10 +216,10 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
|
||||
6: 0 # Added default for selection 6 if needed
|
||||
}.get(selection, 0)
|
||||
# 根据选择设置对应的模式和结束模式
|
||||
if selection == 1:
|
||||
if selection == 1: #招标公告
|
||||
path=get_notice(pdf_path, output_folder, begin_page,common_header, invalid_endpage)
|
||||
return [path or ""]
|
||||
elif selection == 2:
|
||||
elif selection == 2: #评标方法
|
||||
begin_pattern = regex.compile(
|
||||
r'^第[一二三四五六七八九十]+(?:章|部分)\s*'
|
||||
r'(?<!"\s*)(?<!“\s*)(?<!”\s*)(?=.*(?:磋商(?=.*(?:办法|方法|内容))|'
|
||||
@ -230,7 +230,7 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
|
||||
r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+',regex.MULTILINE
|
||||
)
|
||||
local_output_suffix = "evaluation_method"
|
||||
elif selection == 3:
|
||||
elif selection == 3: #资格审查
|
||||
begin_pattern = regex.compile(
|
||||
r'^第[一二三四五六七八九十百千]+(?:章|部分).*?(资格审查).*', regex.MULTILINE
|
||||
)
|
||||
@ -238,10 +238,10 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
|
||||
r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+', regex.MULTILINE
|
||||
)
|
||||
local_output_suffix = "qualification1"
|
||||
elif selection == 4:
|
||||
elif selection == 4: #投标人须知:前附表+须知正文
|
||||
path1, path2 = extract_pages_tobidders_notice(pdf_path, output_folder, begin_page, common_header, invalid_endpage)
|
||||
return [path1 or "", path2 or ""]
|
||||
elif selection == 5:
|
||||
elif selection == 5: #采购需求
|
||||
begin_pattern = regex.compile(
|
||||
r'^第[一二三四五六七八九十百千]+(?:章|部分).*?(?:服务|项目|商务|技术|供货).*?要求|'
|
||||
r'^第[一二三四五六七八九十百千]+(?:章|部分)(?!.*说明).*(?:采购.*?(?:内容|要求|需求)|(招标|项目)(?:内容|要求|需求)).*|'
|
||||
@ -251,7 +251,7 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
|
||||
r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+',regex.MULTILINE
|
||||
)
|
||||
local_output_suffix = "procurement"
|
||||
elif selection == 6:
|
||||
elif selection == 6: #除去投标文件格式之前的内容 或者 合同条款之前的内容
|
||||
invalid_path, end_page = get_invalid_file(pdf_path, output_folder, common_header, begin_page)
|
||||
return [invalid_path or "", end_page]
|
||||
else:
|
||||
|
@ -656,7 +656,7 @@ def get_technical_requirements(invalid_path, processed_filepath, model_type=1):
|
||||
else:
|
||||
# 第一步:收集需要调用 `continue_answer` 的问题和解析结果
|
||||
questions_to_continue = [] # 存储需要调用 continue_answer 的 (question, parsed)
|
||||
max_tokens = 3900 if model_type == 1 else 5900
|
||||
max_tokens = 8100 if model_type == 1 else 5900 #plus的max_tokens为8192 qianwen-long为6000,这里稍微取小一点,如果换doubao只有4000!!!
|
||||
for question, response in results:
|
||||
message = response[0]
|
||||
parsed = clean_json_string(message)
|
||||
@ -674,7 +674,7 @@ def get_technical_requirements(invalid_path, processed_filepath, model_type=1):
|
||||
# 更新原始采购需求字典
|
||||
final_res = combine_and_update_results(modified_data, temp_final)
|
||||
ffinal_res = main_postprocess(final_res)
|
||||
ffinal_res["货物列表"] = good_list
|
||||
ffinal_res["货物列表"] = good_list #这里会将需要采购的货物列表带出来
|
||||
# 输出最终的 JSON 字符串
|
||||
return {"采购需求": ffinal_res}
|
||||
|
||||
|
@ -12,8 +12,8 @@ def fetch_procurement_reqs(procurement_path, invalid_path):
|
||||
#procurement_path可能是pdf\docx
|
||||
# 定义默认的 procurement_reqs 字典
|
||||
DEFAULT_PROCUREMENT_REQS = {
|
||||
"采购需求": {},
|
||||
"技术要求": [],
|
||||
"采购需求": {}, #对具体的货物采购要求 技术参数
|
||||
"技术要求": [], #对供应商的技术、商务 服务要求 ,而不是对具体货物
|
||||
"商务要求": [],
|
||||
"服务要求": [],
|
||||
"其他要求": []
|
||||
@ -46,9 +46,9 @@ def fetch_procurement_reqs(procurement_path, invalid_path):
|
||||
# 使用 ThreadPoolExecutor 并行处理 get_technical_requirements 和 get_business_requirements
|
||||
with concurrent.futures.ThreadPoolExecutor() as executor:
|
||||
# 提交任务给线程池
|
||||
future_technical = executor.submit(get_technical_requirements, invalid_path, processed_filepath, tech_model_type)
|
||||
future_technical = executor.submit(get_technical_requirements, invalid_path, processed_filepath, tech_model_type) #采购需求
|
||||
time.sleep(0.5) # 保持原有的延时
|
||||
future_business = executor.submit(get_business_requirements, procurement_path, processed_filepath, busi_model_type)
|
||||
future_business = executor.submit(get_business_requirements, procurement_path, processed_filepath, busi_model_type) #技术、商务、服务、其他要求
|
||||
# 获取并行任务的结果
|
||||
technical_requirements = future_technical.result()
|
||||
business_requirements = future_business.result()
|
||||
|
BIN
md_files/0.png
Normal file
After Width: | Height: | Size: 40 KiB |
BIN
md_files/1.png
Normal file
After Width: | Height: | Size: 27 KiB |
BIN
md_files/10.png
Normal file
After Width: | Height: | Size: 82 KiB |
BIN
md_files/11.png
Normal file
After Width: | Height: | Size: 4.1 KiB |
BIN
md_files/12.png
Normal file
After Width: | Height: | Size: 29 KiB |
BIN
md_files/13.png
Normal file
After Width: | Height: | Size: 27 KiB |
BIN
md_files/14.png
Normal file
After Width: | Height: | Size: 29 KiB |
BIN
md_files/16.png
Normal file
After Width: | Height: | Size: 43 KiB |
BIN
md_files/17.png
Normal file
After Width: | Height: | Size: 23 KiB |
BIN
md_files/2.png
Normal file
After Width: | Height: | Size: 27 KiB |
BIN
md_files/3.png
Normal file
After Width: | Height: | Size: 22 KiB |
BIN
md_files/4.png
Normal file
After Width: | Height: | Size: 14 KiB |
BIN
md_files/5.png
Normal file
After Width: | Height: | Size: 51 KiB |
BIN
md_files/6.png
Normal file
After Width: | Height: | Size: 12 KiB |
BIN
md_files/7.png
Normal file
After Width: | Height: | Size: 20 KiB |
BIN
md_files/8.png
Normal file
After Width: | Height: | Size: 27 KiB |
BIN
md_files/9.png
Normal file
After Width: | Height: | Size: 7.0 KiB |