2.7 添加注释
261
README.md
Normal file
@ -0,0 +1,261 @@
|
|||||||
|
项目地址:[zy123/zbparse - zbparse - 智标领航代码仓库](http://47.98.59.178:3000/zy123/zbparse)
|
||||||
|
|
||||||
|
git clone地址:http://47.98.59.178:3000/zy123/zbparse.git
|
||||||
|
|
||||||
|
选择develop分支,develop-xx 后面的xx越近越新。
|
||||||
|
|
||||||
|
正式环境:121.41.119.164:5000
|
||||||
|
|
||||||
|
测试环境:47.98.58.178:5000
|
||||||
|
|
||||||
|
大解析:指从招标文件解析入口进去,upload.py
|
||||||
|
|
||||||
|
小解析:从投标文件生成入口进去,little_zbparse 和get_deviation,两个接口后端一起调
|
||||||
|
|
||||||
|
## 项目结构:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
.env存放一些密钥(大模型、textin等),它是gitignore忽略了,因此在服务器上git pull项目的时候,这个文件不会更新(因为密钥比较重要),需要手动维护服务器相应位置的.env。
|
||||||
|
|
||||||
|
**如何更新服务器上的版本:**
|
||||||
|
|
||||||
|
1. 进入项目文件夹
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
**注意:**需要确认.env是否存在在服务器,默认是隐藏的
|
||||||
|
输入cat .env
|
||||||
|
如果不存在,在项目文件夹下sudo vim .env
|
||||||
|
|
||||||
|
将密钥粘贴进去!!!
|
||||||
|
|
||||||
|
2. git pull
|
||||||
|
|
||||||
|
3. sudo docker-compose up --build -d 更新并重启
|
||||||
|
|
||||||
|
或者 sudo docker-compose build 先构建镜像
|
||||||
|
|
||||||
|
sudo docker-compose up -d 等空间时再重启
|
||||||
|
|
||||||
|
4. sudo docker-compose logs flask_app --since 1h 查看最近1h的日志(如果重启后报错也能查看,推荐重启后都运行一下这个)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
requirements.txt一般无需变动,除非代码中使用了新的库,也要手动在该文件中添加包名及对应的版本
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**如何本地启动本项目:**
|
||||||
|
|
||||||
|
1. requirements.txt里的环境要配好
|
||||||
|
2. .env环境配好 (一般不需要在电脑环境变量中额外配置了)
|
||||||
|
|
||||||
|
3. 点击下拉框,Edit configurations
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
设置run_serve.py为启动脚本
|
||||||
|
|
||||||
|
4. postman打post请求测试:
|
||||||
|
|
||||||
|
http://127.0.0.1:5000/upload
|
||||||
|
|
||||||
|
body:
|
||||||
|
|
||||||
|
{
|
||||||
|
|
||||||
|
"file_url":"xxxx",
|
||||||
|
|
||||||
|
"zb_type":2
|
||||||
|
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## flask_app结构介绍
|
||||||
|
|
||||||
|
### general
|
||||||
|
|
||||||
|
是公共函数存放的文件夹,llm下是各类大模型,读取文件下是docx pdf文件的读取以及文档清理clean_pdf,去页眉页脚页码
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
general下的llm下的清除file_id.py 需要**每周运行至少一次**,防止file_id数量超出(我这边对每次请求结束都有file_id记录并清理,向应该还没加)
|
||||||
|
|
||||||
|
llm下的model_continue_query是'模型继续回答'脚本,应对超长文本模型一次无法输出完的情况,继续提问,拼接成完整的内容。
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
general下的file2markdown是textin 文件--》markdown
|
||||||
|
|
||||||
|
general下的format_change是pdf-》docx 或doc/docx->pdf
|
||||||
|
|
||||||
|
general下的merge_pdfs.py是拼接文件的:1.拼接招标公告+投标人须知 2.拼接评标细则章节+资格审查章节
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**general中比较重要的!!!**
|
||||||
|
|
||||||
|
**后处理:**
|
||||||
|
|
||||||
|
general下的**post_processing**,解析后的后处理部分,包括extract_info、 资格审查、技术偏离 商务偏离 所需提交的证明材料,都在这块生成。
|
||||||
|
|
||||||
|
post_processing中的**inner_post_processing**专门提取*extracted_info*
|
||||||
|
|
||||||
|
post_processing中的**process_functions_in_parallel**提取
|
||||||
|
|
||||||
|
资格审查、技术偏离、 商务偏离、 所需提交的证明材料
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
大解析upload用了post_processing完整版,
|
||||||
|
|
||||||
|
little_zbparse.py、小解析main.py用了inner_post_processing
|
||||||
|
|
||||||
|
get_deviation.py、偏离表数据解析main.py用了process_functions_in_parallel
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**截取pdf:**
|
||||||
|
|
||||||
|
*截取pdf_main.py*是顶级函数,
|
||||||
|
|
||||||
|
二级是*截取pdf货物标版*.py和*截取pdf工程标版.py* (非general下)
|
||||||
|
|
||||||
|
三级是*截取pdf通用函数.py*
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**无效标和废标公共代码**
|
||||||
|
|
||||||
|
获取无效标与废标项的主要执行代码。对docx文件进行预处理=》正则=》temp.txt=》大模型筛选
|
||||||
|
如果提的不全,可能是正则没涵盖到位,也可能是大模型提示词漏选了。
|
||||||
|
|
||||||
|
这里:如果段落中既被正则匹配,又被follow_up_keywords中的任意一个匹配,那么不会添加到temp中(即不会被大模型筛选),它会**直接添加**到最后的返回中!
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**投标人须知正文条款提取成json文件**
|
||||||
|
|
||||||
|
将截取到的ztbfile_tobidders_notice_part2.pdf ,即须知正文,转为clause1.json 文件,便于后续提取**开评定标流程**、**投标文件要求**、**重新招标、不再招标和终止招标**
|
||||||
|
|
||||||
|
这块的主要逻辑就是匹配形如'一、总则'这样的大章节
|
||||||
|
|
||||||
|
然后匹配形如'1.1' '1.1.1'这样的序号,由于是按行读取pdf,一个序号后面的内容可能有好几行,因此遇到下一个序号(如'2.1')开头,之前的内容都视为上一个序号的。
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### old_version
|
||||||
|
|
||||||
|
都是废弃文件代码,未在正式、测试环境中使用的,不用管
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### routes
|
||||||
|
|
||||||
|
是接口以及主要实现部分,一一对应
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
get_deviation对应偏离表数据解析main,获得偏离表数据
|
||||||
|
|
||||||
|
judge_zbfile对应判断是否是招标文件
|
||||||
|
|
||||||
|
little_zbparse对应小解析main,负责解析extract_info
|
||||||
|
|
||||||
|
test_zbparse是测试接口,无对应
|
||||||
|
|
||||||
|
upload对应工程标解析和货物标解析,即大解析
|
||||||
|
|
||||||
|
**混淆澄清**:小解析可以指代一个过程,即从'投标文件生成'这个入口进去的解析,后端会同时调用little_zbparse和get_deviation。这个过程称为'小解析'。
|
||||||
|
|
||||||
|
但是little_zbparse也叫小解析,命名如此因为最初只需返回这些数据(extract_info),后续才陆续返回商务、技术偏离...
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
utils是接口这块的公共功能函数。其中validate_and_setup_logger函数对不同的接口请求对应到不同的output文件夹,如upload->output1。后续增加接口也可直接在这里写映射关系。
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
重点关注大解析:**upload.py**和**货物标解析main.py**
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### static
|
||||||
|
|
||||||
|
存放解析的输出和提示词
|
||||||
|
|
||||||
|
其中output用gitignore了,git push不会推送这块内容。
|
||||||
|
|
||||||
|
各个文件夹(output1 output2..)对应不同的接口请求
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### test_case&testdir
|
||||||
|
|
||||||
|
test_case是测试用例,是对一些函数的测试。好久没更新了
|
||||||
|
|
||||||
|
testdir是平时写代码的测试的地方
|
||||||
|
|
||||||
|
它们都不影响正式和测试环境的解析
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 工程标&货物标
|
||||||
|
|
||||||
|
是两个解析流程中不一样的地方(一样的都写在**general**中了)
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
主要是货物标额外解析了采购要求(提取采购需求main+技术参数要求提取+商务服务其他要求提取)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
### 最后:
|
||||||
|
|
||||||
|
ConnectionLimiter.py定义了接口超时时间->超时后断开与后端的连接
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
logger_setup.py 为每个请求创建单独的log,每个log对应一个log.txt
|
||||||
|
|
||||||
|
start_up.py是启动脚本,run_serve也是启动脚本,是对start_up.py的简单封装,目前dockerfile定义的直接使用run_serve启动
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## 持续关注
|
||||||
|
|
||||||
|
```
|
||||||
|
yield sse_format(tech_deviation_response)
|
||||||
|
yield sse_format(tech_deviation_star_response)
|
||||||
|
yield sse_format(zigefuhe_deviation_response)
|
||||||
|
yield sse_format(shangwu_deviation_response)
|
||||||
|
yield sse_format(shangwu_star_deviation_response)
|
||||||
|
yield sse_format(proof_materials_response)
|
||||||
|
```
|
||||||
|
|
||||||
|
1. 工程标解析目前仍没有解析采购要求这一块,因此后处理返回的只有'资格审查'和''证明材料"和"extracted_info",没有''商务偏离''及'商务带星偏离',也没有'技术偏离'和'技术带星偏离',而货物标解析是完全版。
|
||||||
|
|
||||||
|
其中''证明材料"和"extracted_info"是直接返给后端保存的
|
||||||
|
|
||||||
|
2. 大解析中返回了技术评分,后端接收后不仅显示给前端,还会返给向,用于生成技术偏离表
|
||||||
|
3. 小解析时,get_deviation.py其实也可以返回技术评分,但是没有返回,因为没人和我对接,暂时注释了。
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
4.商务评议和技术评议偏离表,即评分细则的偏离表,暂时没做,但是**商务评分、技术评分**无论大解析还是小解析都解析了,稍微对该数据处理一下返回给后端就行。
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
这个是解析得来的结果,适合给前端展示,但是要生成商务技术评议偏离表的话,需要再调一次大模型,对该数据进行重新归纳,以字符串列表为佳。再传给后端。(未做)
|
@ -266,7 +266,8 @@ def outer_post_processing(combined_data, includes, good_list):
|
|||||||
if "基础信息" in includes:
|
if "基础信息" in includes:
|
||||||
base_info = combined_data.get("基础信息", {})
|
base_info = combined_data.get("基础信息", {})
|
||||||
# 调用内层 inner_post_processing 处理 '基础信息'
|
# 调用内层 inner_post_processing 处理 '基础信息'
|
||||||
extracted_info = inner_post_processing(base_info)
|
extracted_info = inner_post_processing(base_info) #生成extract_info,返回给后端
|
||||||
|
|
||||||
# 将 '基础信息' 保留在处理后的数据中
|
# 将 '基础信息' 保留在处理后的数据中
|
||||||
processed_data["基础信息"] = base_info
|
processed_data["基础信息"] = base_info
|
||||||
# 提取 '采购要求' 下的 '采购需求'
|
# 提取 '采购要求' 下的 '采购需求'
|
||||||
@ -291,7 +292,7 @@ def outer_post_processing(combined_data, includes, good_list):
|
|||||||
busi_eval=combined_data.get("商务评分",{})
|
busi_eval=combined_data.get("商务评分",{})
|
||||||
busi_eval_info=json.dumps(busi_eval,ensure_ascii=False,indent=4)
|
busi_eval_info=json.dumps(busi_eval,ensure_ascii=False,indent=4)
|
||||||
all_data_info = '\n'.join([zige_info, fuhe_info, zigefuhe_info, tech_deviation_info,busi_requirements_info, tech_eval_info,busi_eval_info])
|
all_data_info = '\n'.join([zige_info, fuhe_info, zigefuhe_info, tech_deviation_info,busi_requirements_info, tech_eval_info,busi_eval_info])
|
||||||
tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation,proof_materials = process_functions_in_parallel(
|
tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation,proof_materials = process_functions_in_parallel( #生成技术 商务偏离的main函数
|
||||||
tech_requirements_dict=tech_requirements,
|
tech_requirements_dict=tech_requirements,
|
||||||
busi_requirements_dict=busi_requirements,
|
busi_requirements_dict=busi_requirements,
|
||||||
zige_info=zige_info,
|
zige_info=zige_info,
|
||||||
|
@ -387,8 +387,8 @@ def extract_from_notice(merged_baseinfo_path, clause_path, type):
|
|||||||
|
|
||||||
# 映射 type 到 target_values
|
# 映射 type 到 target_values
|
||||||
type_target_map = {
|
type_target_map = {
|
||||||
1: ["投标", "投标文件", "响应文件"],
|
1: ["投标", "投标文件", "响应文件"], #投标文件要求
|
||||||
2: ["开标", "评标", "定标", "评审", "成交", "合同", "磋商","谈判","中标", "程序", "步骤"],
|
2: ["开标", "评标", "定标", "评审", "成交", "合同", "磋商","谈判","中标", "程序", "步骤"], #开评定标流程
|
||||||
3: ["重新招标、不再招标和终止招标", "重新招标", "重新采购", "不再招标", "不再采购", "终止招标", "终止采购"],
|
3: ["重新招标、不再招标和终止招标", "重新招标", "重新采购", "不再招标", "不再采购", "终止招标", "终止采购"],
|
||||||
4: ["评标"] # 测试
|
4: ["评标"] # 测试
|
||||||
}
|
}
|
||||||
|
@ -10,11 +10,10 @@ from flask_app.general.截取pdf通用函数 import clean_page_content, extract_
|
|||||||
def compare_headings(current, new):
|
def compare_headings(current, new):
|
||||||
"""
|
"""
|
||||||
比较两个标题的层次关系,并确保新标题比当前标题大且最高位数字差值不超过5。
|
比较两个标题的层次关系,并确保新标题比当前标题大且最高位数字差值不超过5。
|
||||||
|
因为默认新的序号是比旧的序号大的,同时也要防止一些特别大的序号在行首,如'2025年xxx' 错误地将'2025'匹配成序号了,事实上它只是正文的一部分。
|
||||||
参数:
|
参数:
|
||||||
current (str): 当前标题,例如 "1.2.3"
|
current (str): 当前标题,例如 "1.2.3"
|
||||||
new (str): 新标题,例如 "1.3"
|
new (str): 新标题,例如 "1.3"
|
||||||
|
|
||||||
返回:
|
返回:
|
||||||
bool: 如果新标题大于当前标题且最高位数字差值不超过3,则返回 True,否则返回 False
|
bool: 如果新标题大于当前标题且最高位数字差值不超过3,则返回 True,否则返回 False
|
||||||
"""
|
"""
|
||||||
@ -80,7 +79,7 @@ def parse_text_by_heading(text):
|
|||||||
lines = text.split('\n')
|
lines = text.split('\n')
|
||||||
|
|
||||||
def get_current_number(key_chinese):
|
def get_current_number(key_chinese):
|
||||||
chinese_to_number = {
|
chinese_to_number = { #便于比较中文序号大小
|
||||||
'一': 1, '二': 2, '三': 3, '四': 4, '五': 5,
|
'一': 1, '二': 2, '三': 3, '四': 4, '五': 5,
|
||||||
'六': 6, '七': 7, '八': 8, '九': 9, '十': 10,
|
'六': 6, '七': 7, '八': 8, '九': 9, '十': 10,
|
||||||
'十一': 11, '十二': 12, '十三': 13, '十四': 14, '十五': 15
|
'十一': 11, '十二': 12, '十三': 13, '十四': 14, '十五': 15
|
||||||
@ -99,7 +98,7 @@ def parse_text_by_heading(text):
|
|||||||
patterns = [
|
patterns = [
|
||||||
r'^(?<![a-zA-Z((])(\d+(?:\.\d+)+)\s*(.*)', # 匹配 '12.1 内容'
|
r'^(?<![a-zA-Z((])(\d+(?:\.\d+)+)\s*(.*)', # 匹配 '12.1 内容'
|
||||||
r'^(\d+\.)\s*(.+)$', # 匹配 '12. 内容'
|
r'^(\d+\.)\s*(.+)$', # 匹配 '12. 内容'
|
||||||
r'^[..](\d+(?:[..]\d+)*)\s*(.+)$', # 匹配 '.12.1 内容'
|
r'^[..](\d+(?:[..]\d+)*)\s*(.+)$', # 匹配 '.12.1 内容' 这种点号开头的情况是因为读取pdf时clean_page_content进行了清理,删去了页眉页脚和页码,序号可能会被误判为页码删除,这种情况已在代码中进行了不错的处理。
|
||||||
r'^(\d+)([^.\d].*)' # 匹配 '27 内容'
|
r'^(\d+)([^.\d].*)' # 匹配 '27 内容'
|
||||||
]
|
]
|
||||||
for pattern in patterns:
|
for pattern in patterns:
|
||||||
@ -111,9 +110,9 @@ def parse_text_by_heading(text):
|
|||||||
first_five_lines = lines[:5]
|
first_five_lines = lines[:5]
|
||||||
has_initial_heading_patterns = False
|
has_initial_heading_patterns = False
|
||||||
for line in first_five_lines:
|
for line in first_five_lines:
|
||||||
line_stripped = line.strip().replace('.', '.')
|
line_stripped = line.strip().replace('.', '.') #line_stripped是处理中的当前行
|
||||||
if line_stripped.startswith("##"):
|
if line_stripped.startswith("##"): #预处理时添加的,对每页pdf的首行前打了标记'##' ,便于特殊处理;因为首行的序号往往会因为clean_page_content被错误删除!需补全该序号。
|
||||||
line_stripped = line_stripped[2:] # Remove "##"
|
line_stripped = line_stripped[2:] # 移除 "##"
|
||||||
if (pattern_numbered.match(line_stripped) or pattern_parentheses.match(
|
if (pattern_numbered.match(line_stripped) or pattern_parentheses.match(
|
||||||
line_stripped) or pattern_letter_initial.match(line_stripped)):
|
line_stripped) or pattern_letter_initial.match(line_stripped)):
|
||||||
has_initial_heading_patterns = True
|
has_initial_heading_patterns = True
|
||||||
@ -135,14 +134,14 @@ def parse_text_by_heading(text):
|
|||||||
if not match:
|
if not match:
|
||||||
match = re.match(r'^(\d+\.)\s*(.+)$', line_stripped)
|
match = re.match(r'^(\d+\.)\s*(.+)$', line_stripped)
|
||||||
|
|
||||||
# 检查是否进入或退出特殊章节
|
# 检查是否进入或退出特殊章节: 防止如'5、 竞争性磋商采购文件的构成'后面若干行出现'一、招标公告''二、投标人须知''七、 xxxx',这种情况,导致程序错误地匹配大标题,但是它们应该视为'5、 竞争性磋商采购文件的构成'后的正文部分!
|
||||||
|
#进入特殊章节:处理到的行都视为当前序号的内容,而不作为新的序号
|
||||||
if is_heading(line_stripped):
|
if is_heading(line_stripped):
|
||||||
if any(re.search(pattern, line_stripped) for pattern in special_section_keywords):
|
if any(re.search(pattern, line_stripped) for pattern in special_section_keywords):
|
||||||
in_special_section = True
|
in_special_section = True
|
||||||
elif in_special_section:
|
elif in_special_section:
|
||||||
in_special_section = False
|
in_special_section = False
|
||||||
|
|
||||||
# 以下是原有的匹配逻辑
|
|
||||||
# 匹配以点号开头并带有数字的情况,例如 '.12.1 内容'
|
# 匹配以点号开头并带有数字的情况,例如 '.12.1 内容'
|
||||||
dot_match = re.match(r'^[..](\d+(?:[..]\d+)*)\s*(.+)$', line_stripped)
|
dot_match = re.match(r'^[..](\d+(?:[..]\d+)*)\s*(.+)$', line_stripped)
|
||||||
|
|
||||||
@ -283,8 +282,9 @@ def parse_text_by_heading(text):
|
|||||||
in_special_section)
|
in_special_section)
|
||||||
|
|
||||||
else:
|
else:
|
||||||
# 根据预先设置的标志决定是否执行这部分代码
|
#前面的都没匹配上,那么当前处理行可能是大标题'一、总则' 或者是普通正文部分'应通知采购代理机构补全或更换,否则风险自负。'
|
||||||
if has_initial_heading_patterns and not skip_subheadings and not in_special_section:
|
if has_initial_heading_patterns and not skip_subheadings and not in_special_section:
|
||||||
|
# 匹配上任意一种大标题形式:'一、xx' '(一)、xx' 'A.xx' 且不处于'特殊章节' ,继续执行后面的代码
|
||||||
numbered_match = pattern_numbered.match(line_stripped) # 一、
|
numbered_match = pattern_numbered.match(line_stripped) # 一、
|
||||||
parentheses_match = pattern_parentheses.match(line_stripped) # (一)
|
parentheses_match = pattern_parentheses.match(line_stripped) # (一)
|
||||||
if i < 5:
|
if i < 5:
|
||||||
@ -368,7 +368,7 @@ def parse_text_by_heading(text):
|
|||||||
append_newline = handle_content_append(current_content, line_stripped, append_newline, keywords,
|
append_newline = handle_content_append(current_content, line_stripped, append_newline, keywords,
|
||||||
in_special_section)
|
in_special_section)
|
||||||
else:
|
else:
|
||||||
# 在特殊章节中,所有内容都作为当前标题的内容
|
#该情况下,所有内容都作为当前标题的内容,添加到current_content中
|
||||||
if line_stripped:
|
if line_stripped:
|
||||||
append_newline = handle_content_append(current_content, line_stripped, append_newline, keywords,
|
append_newline = handle_content_append(current_content, line_stripped, append_newline, keywords,
|
||||||
in_special_section)
|
in_special_section)
|
||||||
|
@ -74,8 +74,11 @@ def clean_dict_datas(extracted_contents, keywords, excludes): # 让正则表达
|
|||||||
|
|
||||||
return all_text1, all_text2 # all_texts1要额外用gpt all_text2直接返回结果
|
return all_text1, all_text2 # all_texts1要额外用gpt all_text2直接返回结果
|
||||||
|
|
||||||
#处理跨页的段落
|
|
||||||
def preprocess_paragraphs(elements):
|
def preprocess_paragraphs(elements):
|
||||||
|
'''
|
||||||
|
处理跨页的段落,程序逻辑判断两个段落能否合并在一起。
|
||||||
|
'''
|
||||||
processed = [] # 初始化处理后的段落列表
|
processed = [] # 初始化处理后的段落列表
|
||||||
index = 0
|
index = 0
|
||||||
flag = False # 初始化标志位
|
flag = False # 初始化标志位
|
||||||
@ -437,8 +440,10 @@ def split_cell_text(text):
|
|||||||
# print(split_sentences)
|
# print(split_sentences)
|
||||||
return split_sentences
|
return split_sentences
|
||||||
|
|
||||||
# 文件预处理----按文件顺序提取文本和表格,并合并跨页表格
|
|
||||||
def extract_file_elements(file_path):
|
def extract_file_elements(file_path):
|
||||||
|
'''
|
||||||
|
文件预处理----按文件顺序提取文本和表格,并合并跨页表格
|
||||||
|
'''
|
||||||
doc = Document(file_path)
|
doc = Document(file_path)
|
||||||
doc_elements = doc.element.body
|
doc_elements = doc.element.body
|
||||||
doc_paragraphs = doc.paragraphs
|
doc_paragraphs = doc.paragraphs
|
||||||
|
@ -71,4 +71,4 @@ def create_logger(app, subfolder):
|
|||||||
logger.setLevel(logging.INFO)
|
logger.setLevel(logging.INFO)
|
||||||
logger.propagate = False
|
logger.propagate = False
|
||||||
g.logger = logger
|
g.logger = logger
|
||||||
g.output_folder = output_folder
|
g.output_folder = output_folder #输出文件夹路径
|
||||||
|
@ -11,7 +11,7 @@ from flask_app.工程标.无效标和废标和禁止投标整合 import combine_
|
|||||||
from flask_app.工程标.投标人须知正文提取指定内容工程标 import extract_from_notice
|
from flask_app.工程标.投标人须知正文提取指定内容工程标 import extract_from_notice
|
||||||
import concurrent.futures
|
import concurrent.futures
|
||||||
from flask_app.工程标.基础信息整合工程标 import combine_basic_info
|
from flask_app.工程标.基础信息整合工程标 import combine_basic_info
|
||||||
from flask_app.工程标.资格审查模块 import combine_review_standards
|
from flask_app.工程标.资格审查模块main import combine_review_standards
|
||||||
from flask_app.old_version.商务评分技术评分整合old_version import combine_evaluation_standards
|
from flask_app.old_version.商务评分技术评分整合old_version import combine_evaluation_standards
|
||||||
from flask_app.general.format_change import pdf2docx, docx2pdf,doc2docx
|
from flask_app.general.format_change import pdf2docx, docx2pdf,doc2docx
|
||||||
from flask_app.general.docx截取docx import copy_docx
|
from flask_app.general.docx截取docx import copy_docx
|
||||||
|
@ -13,7 +13,7 @@ get_deviation_bp = Blueprint('get_deviation', __name__)
|
|||||||
@get_deviation_bp.route('/get_deviation', methods=['POST'])
|
@get_deviation_bp.route('/get_deviation', methods=['POST'])
|
||||||
@validate_and_setup_logger
|
@validate_and_setup_logger
|
||||||
@require_connection_limit(timeout=720)
|
@require_connection_limit(timeout=720)
|
||||||
def get_deviation():
|
def get_deviation(): #提供商务、技术偏离的数据
|
||||||
logger = g.logger
|
logger = g.logger
|
||||||
unique_id = g.unique_id
|
unique_id = g.unique_id
|
||||||
file_url = g.file_url
|
file_url = g.file_url
|
||||||
|
@ -16,7 +16,7 @@ class JudgeResult(Enum):
|
|||||||
@judge_zbfile_bp.route('/judge_zbfile', methods=['POST'])
|
@judge_zbfile_bp.route('/judge_zbfile', methods=['POST'])
|
||||||
@validate_and_setup_logger
|
@validate_and_setup_logger
|
||||||
# @require_connection_limit(timeout=30)
|
# @require_connection_limit(timeout=30)
|
||||||
def judge_zbfile() -> Any:
|
def judge_zbfile() -> Any: #判断是否是招标文件
|
||||||
"""
|
"""
|
||||||
主函数,调用 wrapper 并设置整个接口的超时时时间。如果超时返回默认值。
|
主函数,调用 wrapper 并设置整个接口的超时时时间。如果超时返回默认值。
|
||||||
"""
|
"""
|
||||||
|
@ -13,7 +13,7 @@ little_zbparse_bp = Blueprint('little_zbparse', __name__)
|
|||||||
@little_zbparse_bp.route('/little_zbparse', methods=['POST'])
|
@little_zbparse_bp.route('/little_zbparse', methods=['POST'])
|
||||||
@validate_and_setup_logger
|
@validate_and_setup_logger
|
||||||
@require_connection_limit(timeout=300)
|
@require_connection_limit(timeout=300)
|
||||||
def little_zbparse():
|
def little_zbparse(): #小解析
|
||||||
logger = g.logger
|
logger = g.logger
|
||||||
file_url = g.file_url
|
file_url = g.file_url
|
||||||
zb_type = g.zb_type
|
zb_type = g.zb_type
|
||||||
|
@ -15,7 +15,7 @@ upload_bp = Blueprint('upload', __name__)
|
|||||||
@upload_bp.route('/upload', methods=['POST'])
|
@upload_bp.route('/upload', methods=['POST'])
|
||||||
@validate_and_setup_logger
|
@validate_and_setup_logger
|
||||||
@require_connection_limit(timeout=720)
|
@require_connection_limit(timeout=720)
|
||||||
def zbparse():
|
def zbparse(): #大解析
|
||||||
logger = g.logger
|
logger = g.logger
|
||||||
try:
|
try:
|
||||||
logger.info("大解析开始!!!")
|
logger.info("大解析开始!!!")
|
||||||
@ -25,7 +25,7 @@ def zbparse():
|
|||||||
zb_type = g.zb_type
|
zb_type = g.zb_type
|
||||||
try:
|
try:
|
||||||
logger.info("starting parsing url:" + file_url)
|
logger.info("starting parsing url:" + file_url)
|
||||||
return process_and_stream(file_url, zb_type)
|
return process_and_stream(file_url, zb_type) #主要执行函数
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error('Exception occurred: ' + str(e))
|
logger.error('Exception occurred: ' + str(e))
|
||||||
if hasattr(g, 'unique_id'):
|
if hasattr(g, 'unique_id'):
|
||||||
@ -89,12 +89,12 @@ def process_and_stream(file_url, zb_type):
|
|||||||
good_list = None
|
good_list = None
|
||||||
|
|
||||||
processing_functions = {
|
processing_functions = {
|
||||||
1: engineering_bid_main,
|
1: engineering_bid_main, #工程标解析
|
||||||
2: goods_bid_main
|
2: goods_bid_main #货物标解析/服务标解析
|
||||||
}
|
}
|
||||||
processing_func = processing_functions.get(zb_type, goods_bid_main)
|
processing_func = processing_functions.get(zb_type, goods_bid_main)
|
||||||
|
|
||||||
for data in processing_func(output_folder, downloaded_filepath, file_type, unique_id):
|
for data in processing_func(output_folder, downloaded_filepath, file_type, unique_id): #逐一接收货物标 工程标解析内容,为前端网页展示服务
|
||||||
if not data.strip():
|
if not data.strip():
|
||||||
logger.error("Received empty data, skipping JSON parsing.")
|
logger.error("Received empty data, skipping JSON parsing.")
|
||||||
continue
|
continue
|
||||||
@ -117,7 +117,7 @@ def process_and_stream(file_url, zb_type):
|
|||||||
yield sse_format(error_response)
|
yield sse_format(error_response)
|
||||||
return # 终止进一步处理
|
return # 终止进一步处理
|
||||||
|
|
||||||
if 'good_list' in parsed_data:
|
if 'good_list' in parsed_data: #货物列表
|
||||||
good_list = parsed_data['good_list']
|
good_list = parsed_data['good_list']
|
||||||
logger.info("Collected good_list from the processing function: %s", good_list)
|
logger.info("Collected good_list from the processing function: %s", good_list)
|
||||||
continue
|
continue
|
||||||
@ -131,20 +131,22 @@ def process_and_stream(file_url, zb_type):
|
|||||||
status='success',
|
status='success',
|
||||||
data=data
|
data=data
|
||||||
)
|
)
|
||||||
yield sse_format(response)
|
yield sse_format(response) #返回给后端->前端展示
|
||||||
|
|
||||||
base_end_time = time.time()
|
base_end_time = time.time()
|
||||||
logger.info(f"分段解析完成,耗时:{base_end_time - start_time:.2f} 秒")
|
logger.info(f"分段解析完成,耗时:{base_end_time - start_time:.2f} 秒")
|
||||||
|
#此时前端已完整接收到解析的所有内容,后面的内容与前端展示无关,主要是后处理:1.extracted_result,关键信息存储 2.技术偏离表 3.商务偏离表 4.投标人需提交的证明材料(目前后端存储了,前端还未展示)
|
||||||
|
#后处理开始!!!
|
||||||
output_json_path = os.path.join(output_folder, 'final_result.json')
|
output_json_path = os.path.join(output_folder, 'final_result.json')
|
||||||
extracted_info_path = os.path.join(output_folder, 'extracted_result.json')
|
extracted_info_path = os.path.join(output_folder, 'extracted_result.json')
|
||||||
includes = ["基础信息", "资格审查", "商务评分", "技术评分", "无效标与废标项", "投标文件要求", "开评定标流程"]
|
includes = ["基础信息", "资格审查", "商务评分", "技术评分", "无效标与废标项", "投标文件要求", "开评定标流程"]
|
||||||
final_result, extracted_info, tech_deviation, tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation, proof_materials = outer_post_processing(
|
final_result, extracted_info, tech_deviation, tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation, proof_materials = outer_post_processing(
|
||||||
combined_data, includes, good_list)
|
combined_data, includes, good_list) #后处理 生成 extracted_info、商务 技术偏离数据 以及证明材料返给后端
|
||||||
|
|
||||||
|
#后处理完毕!后面都是生成响应返回,不额外修改数据
|
||||||
tech_deviation_response, tech_deviation_star_response, zigefuhe_deviation_response, shangwu_deviation_response, shangwu_star_deviation_response, proof_materials_response = generate_deviation_response(
|
tech_deviation_response, tech_deviation_star_response, zigefuhe_deviation_response, shangwu_deviation_response, shangwu_star_deviation_response, proof_materials_response = generate_deviation_response(
|
||||||
tech_deviation, tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation,
|
tech_deviation, tech_star_deviation, business_deviation, business_star_deviation, zigefuhe_deviation,
|
||||||
proof_materials, logger)
|
proof_materials, logger) #生成规范的响应
|
||||||
|
|
||||||
# 使用通用响应函数
|
# 使用通用响应函数
|
||||||
yield sse_format(tech_deviation_response)
|
yield sse_format(tech_deviation_response)
|
||||||
@ -184,7 +186,7 @@ def process_and_stream(file_url, zb_type):
|
|||||||
)
|
)
|
||||||
yield sse_format(complete_response)
|
yield sse_format(complete_response)
|
||||||
|
|
||||||
final_response = create_response(
|
final_response = create_response( #目前后端的逻辑是读取到'data'中有个'END',就终止连接
|
||||||
message='文件上传并处理成功',
|
message='文件上传并处理成功',
|
||||||
status='success',
|
status='success',
|
||||||
data='END'
|
data='END'
|
||||||
|
@ -16,7 +16,7 @@ from flask_app.general.无效标和废标公共代码 import combine_find_invali
|
|||||||
from flask_app.general.投标人须知正文提取指定内容 import extract_from_notice
|
from flask_app.general.投标人须知正文提取指定内容 import extract_from_notice
|
||||||
import concurrent.futures
|
import concurrent.futures
|
||||||
from flask_app.工程标.基础信息整合工程标 import combine_basic_info
|
from flask_app.工程标.基础信息整合工程标 import combine_basic_info
|
||||||
from flask_app.工程标.资格审查模块 import combine_review_standards
|
from flask_app.工程标.资格审查模块main import combine_review_standards
|
||||||
from flask_app.general.商务技术评分提取 import combine_evaluation_standards
|
from flask_app.general.商务技术评分提取 import combine_evaluation_standards
|
||||||
from flask_app.general.format_change import pdf2docx, docx2pdf
|
from flask_app.general.format_change import pdf2docx, docx2pdf
|
||||||
|
|
||||||
|
@ -34,9 +34,8 @@ def preprocess_files(output_folder, file_path, file_type,logger):
|
|||||||
|
|
||||||
# 调用截取PDF多次
|
# 调用截取PDF多次
|
||||||
truncate_files = truncate_pdf_multiple(pdf_path, output_folder,logger,'goods') # index: 0->商务技术服务要求 1->评标办法 2->资格审查 3->投标人须知前附表 4->投标人须知正文
|
truncate_files = truncate_pdf_multiple(pdf_path, output_folder,logger,'goods') # index: 0->商务技术服务要求 1->评标办法 2->资格审查 3->投标人须知前附表 4->投标人须知正文
|
||||||
|
|
||||||
# 处理各个部分
|
# 处理各个部分
|
||||||
invalid_path = truncate_files[6] if truncate_files[6] != "" else pdf_path #无效标
|
invalid_path = truncate_files[6] if truncate_files[6] != "" else pdf_path #无效标(投标文件格式\合同条款之前的内容)
|
||||||
|
|
||||||
invalid_added_pdf = insert_mark(invalid_path)
|
invalid_added_pdf = insert_mark(invalid_path)
|
||||||
invalid_added_docx = pdf2docx(invalid_added_pdf) # 有标记的invalid_path
|
invalid_added_docx = pdf2docx(invalid_added_pdf) # 有标记的invalid_path
|
||||||
@ -141,6 +140,7 @@ def fetch_invalid_requirements(invalid_added_docx, output_folder, logger):
|
|||||||
result = {"无效标与废标": {}}
|
result = {"无效标与废标": {}}
|
||||||
return result
|
return result
|
||||||
|
|
||||||
|
#投标文件要求
|
||||||
def fetch_bidding_documents_requirements(invalid_deleted_docx, merged_baseinfo_path, clause_path, logger):
|
def fetch_bidding_documents_requirements(invalid_deleted_docx, merged_baseinfo_path, clause_path, logger):
|
||||||
logger.info("starting 投标文件要求...")
|
logger.info("starting 投标文件要求...")
|
||||||
if not merged_baseinfo_path:
|
if not merged_baseinfo_path:
|
||||||
@ -216,22 +216,28 @@ def goods_bid_main(output_folder, file_path, file_type, unique_id):
|
|||||||
with concurrent.futures.ThreadPoolExecutor() as executor:
|
with concurrent.futures.ThreadPoolExecutor() as executor:
|
||||||
# 立即启动不依赖 knowledge_name 和 index 的任务
|
# 立即启动不依赖 knowledge_name 和 index 的任务
|
||||||
futures = {
|
futures = {
|
||||||
'evaluation_standards': executor.submit(fetch_evaluation_standards,processed_data['invalid_deleted_docx'],
|
'evaluation_standards': executor.submit(fetch_evaluation_standards,processed_data['invalid_deleted_docx'], #技术评分 商务评分
|
||||||
processed_data['evaluation_method_path'],logger),
|
processed_data['evaluation_method_path'],logger),
|
||||||
'invalid_requirements': executor.submit(fetch_invalid_requirements, processed_data['invalid_added_docx'],
|
|
||||||
|
'invalid_requirements': executor.submit(fetch_invalid_requirements, processed_data['invalid_added_docx'], #无效标与废标项
|
||||||
output_folder,logger),
|
output_folder,logger),
|
||||||
|
|
||||||
'bidding_documents_requirements': executor.submit(fetch_bidding_documents_requirements,processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],
|
'bidding_documents_requirements': executor.submit(fetch_bidding_documents_requirements,processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],
|
||||||
processed_data['clause_path'],logger),
|
processed_data['clause_path'],logger), #投标文件要求
|
||||||
'opening_bid': executor.submit(fetch_bid_opening, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],processed_data['clause_path'],logger),
|
|
||||||
'base_info': executor.submit(fetch_project_basic_info, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],
|
'opening_bid': executor.submit(fetch_bid_opening, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'],
|
||||||
|
processed_data['clause_path'],logger), #开评定标流程
|
||||||
|
|
||||||
|
'base_info': executor.submit(fetch_project_basic_info, processed_data['invalid_deleted_docx'],processed_data['merged_baseinfo_path'], #基础信息
|
||||||
processed_data['procurement_path'],processed_data['clause_path'],logger),
|
processed_data['procurement_path'],processed_data['clause_path'],logger),
|
||||||
'qualification_review': executor.submit(fetch_qualification_review, processed_data['invalid_deleted_docx'],
|
|
||||||
|
'qualification_review': executor.submit(fetch_qualification_review, processed_data['invalid_deleted_docx'], #资格审查
|
||||||
processed_data['qualification_path'],
|
processed_data['qualification_path'],
|
||||||
processed_data['notice_path'],logger),
|
processed_data['notice_path'],logger),
|
||||||
}
|
}
|
||||||
|
|
||||||
# 提前处理这些不依赖的任务,按完成顺序返回
|
# 提前处理这些不依赖的任务,按完成顺序返回
|
||||||
for future in concurrent.futures.as_completed(futures.values()):
|
for future in concurrent.futures.as_completed(futures.values()): #as_completed:哪个先运行结束就先返回
|
||||||
key = next(k for k, v in futures.items() if v == future)
|
key = next(k for k, v in futures.items() if v == future)
|
||||||
try:
|
try:
|
||||||
result = future.result()
|
result = future.result()
|
||||||
@ -244,8 +250,8 @@ def goods_bid_main(output_folder, file_path, file_type, unique_id):
|
|||||||
technical_standards = result["technical_standards"]
|
technical_standards = result["technical_standards"]
|
||||||
commercial_standards = result["commercial_standards"]
|
commercial_standards = result["commercial_standards"]
|
||||||
# 分别返回技术标和商务标
|
# 分别返回技术标和商务标
|
||||||
yield json.dumps({'technical_standards': transform_json_values(technical_standards)},ensure_ascii=False)
|
yield json.dumps({'technical_standards': transform_json_values(technical_standards)},ensure_ascii=False) #技术评分
|
||||||
yield json.dumps({'commercial_standards': transform_json_values(commercial_standards)},ensure_ascii=False)
|
yield json.dumps({'commercial_standards': transform_json_values(commercial_standards)},ensure_ascii=False) #商务评分
|
||||||
else:
|
else:
|
||||||
# 处理其他任务的结果
|
# 处理其他任务的结果
|
||||||
yield json.dumps({key: transform_json_values(result)}, ensure_ascii=False)
|
yield json.dumps({key: transform_json_values(result)}, ensure_ascii=False)
|
||||||
|
@ -31,6 +31,7 @@ def create_app():
|
|||||||
|
|
||||||
@app.teardown_request
|
@app.teardown_request
|
||||||
def teardown_request(exception):
|
def teardown_request(exception):
|
||||||
|
#接口请求之后都会执行该代码,做一些清理工作
|
||||||
output_folder = getattr(g, 'output_folder', None)
|
output_folder = getattr(g, 'output_folder', None)
|
||||||
if output_folder:
|
if output_folder:
|
||||||
# 执行与output_folder相关的清理操作(例如删除临时文件)
|
# 执行与output_folder相关的清理操作(例如删除临时文件)
|
||||||
|
@ -29,7 +29,7 @@
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
6.请从提供的招标文件中提取与“信息公示媒介”相关的信息(如补充说明、文件澄清、评标结果等)公示媒介(如网址、官网)。按以下 JSON 格式输出信息,其中键名为“信息公示媒介”,键值为一个字符串列表。如果存在多个信息公示媒介,请分别将原文中相关表述逐字添加至字符串中。注意,若只有一个信息公示媒介,字符串列表中只包含一个字符串。示例输出格式如下,仅供参考:
|
6.请从提供的招标文件中提取与“信息公示媒介”相关的信息,即补充说明、文件澄清、评标结果等信息的公示媒介(如网址、官网)。按以下 JSON 格式输出信息,其中键名为“信息公示媒介”,键值为一个字符串列表。如果存在多个信息公示媒介,请分别将原文中相关表述逐字添加至字符串中。注意,若只有一个信息公示媒介,字符串列表中只包含一个字符串。示例输出格式如下,仅供参考:
|
||||||
{
|
{
|
||||||
"信息公示媒介":["招标公告在政府采购网(www.test.gov.cn)发布。","中标结果将在采购网(www.test.bid.cn)予以公告。"]
|
"信息公示媒介":["招标公告在政府采购网(www.test.gov.cn)发布。","中标结果将在采购网(www.test.bid.cn)予以公告。"]
|
||||||
}
|
}
|
||||||
|
@ -29,7 +29,7 @@
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
6.请从提供的招标文件中提取与“信息公示媒介”相关的信息(如补充说明、文件澄清、评标结果等)公示媒介(如网址、官网)。按以下 JSON 格式输出信息,其中键名为“信息公示媒介”,键值为一个字符串列表。如果存在多个信息公示媒介,请分别将原文中相关表述逐字添加至字符串中。注意,若只有一个信息公示媒介,字符串列表中只包含一个字符串。示例输出格式如下,仅供参考:
|
6.请从提供的招标文件中提取与“信息公示媒介”相关的信息,即补充说明、文件澄清、评标结果等信息的公示媒介(如网址、官网)。按以下 JSON 格式输出信息,其中键名为“信息公示媒介”,键值为一个字符串列表。如果存在多个信息公示媒介,请分别将原文中相关表述逐字添加至字符串中。注意,若只有一个信息公示媒介,字符串列表中只包含一个字符串。示例输出格式如下,仅供参考:
|
||||||
{
|
{
|
||||||
"信息公示媒介":["招标公告在政府采购网(www.test.gov.cn)发布。","中标结果将在采购网(www.test.bid.cn)予以公告。"]
|
"信息公示媒介":["招标公告在政府采购网(www.test.gov.cn)发布。","中标结果将在采购网(www.test.bid.cn)予以公告。"]
|
||||||
}
|
}
|
||||||
|
@ -188,8 +188,7 @@ def truncate_pdf_main_engineering(input_path, output_folder, selection, logger,
|
|||||||
# 投标人须知
|
# 投标人须知
|
||||||
path1, path2 = extract_pages_tobidders_notice(pdf_path, output_folder, begin_page, common_header, invalid_endpage)
|
path1, path2 = extract_pages_tobidders_notice(pdf_path, output_folder, begin_page, common_header, invalid_endpage)
|
||||||
return [path1 or "", path2 or ""]
|
return [path1 or "", path2 or ""]
|
||||||
elif selection == 5:
|
elif selection == 5: #除去投标文件格式之前的内容 或者 合同条款之前的内容
|
||||||
#无效标(投标文件格式或合同条款前的内容)
|
|
||||||
invalid_path, end_page = get_invalid_file(pdf_path, output_folder, common_header, begin_page)
|
invalid_path, end_page = get_invalid_file(pdf_path, output_folder, common_header, begin_page)
|
||||||
return [invalid_path or "", end_page]
|
return [invalid_path or "", end_page]
|
||||||
else:
|
else:
|
||||||
|
@ -40,11 +40,11 @@ def combine_basic_info(merged_baseinfo_path, procurement_path,clause_path,invali
|
|||||||
temp_list = []
|
temp_list = []
|
||||||
procurement_reqs = {}
|
procurement_reqs = {}
|
||||||
# 定义一个线程函数来获取基础信息
|
# 定义一个线程函数来获取基础信息
|
||||||
def get_base_info_thread():
|
def get_base_info_thread(): #传统的基础信息提取
|
||||||
nonlocal temp_list
|
nonlocal temp_list
|
||||||
temp_list = get_base_info(merged_baseinfo_path,clause_path,invalid_path)
|
temp_list = get_base_info(merged_baseinfo_path,clause_path,invalid_path)
|
||||||
# 定义一个线程函数来获取采购需求
|
# 定义一个线程函数来获取采购需求
|
||||||
def fetch_procurement_reqs_thread():
|
def fetch_procurement_reqs_thread(): #采购要求提取
|
||||||
nonlocal procurement_reqs
|
nonlocal procurement_reqs
|
||||||
procurement_reqs = fetch_procurement_reqs(procurement_path,invalid_path)
|
procurement_reqs = fetch_procurement_reqs(procurement_path,invalid_path)
|
||||||
# 创建并启动获取基础信息的线程
|
# 创建并启动获取基础信息的线程
|
||||||
|
@ -216,10 +216,10 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
|
|||||||
6: 0 # Added default for selection 6 if needed
|
6: 0 # Added default for selection 6 if needed
|
||||||
}.get(selection, 0)
|
}.get(selection, 0)
|
||||||
# 根据选择设置对应的模式和结束模式
|
# 根据选择设置对应的模式和结束模式
|
||||||
if selection == 1:
|
if selection == 1: #招标公告
|
||||||
path=get_notice(pdf_path, output_folder, begin_page,common_header, invalid_endpage)
|
path=get_notice(pdf_path, output_folder, begin_page,common_header, invalid_endpage)
|
||||||
return [path or ""]
|
return [path or ""]
|
||||||
elif selection == 2:
|
elif selection == 2: #评标方法
|
||||||
begin_pattern = regex.compile(
|
begin_pattern = regex.compile(
|
||||||
r'^第[一二三四五六七八九十]+(?:章|部分)\s*'
|
r'^第[一二三四五六七八九十]+(?:章|部分)\s*'
|
||||||
r'(?<!"\s*)(?<!“\s*)(?<!”\s*)(?=.*(?:磋商(?=.*(?:办法|方法|内容))|'
|
r'(?<!"\s*)(?<!“\s*)(?<!”\s*)(?=.*(?:磋商(?=.*(?:办法|方法|内容))|'
|
||||||
@ -230,7 +230,7 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
|
|||||||
r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+',regex.MULTILINE
|
r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+',regex.MULTILINE
|
||||||
)
|
)
|
||||||
local_output_suffix = "evaluation_method"
|
local_output_suffix = "evaluation_method"
|
||||||
elif selection == 3:
|
elif selection == 3: #资格审查
|
||||||
begin_pattern = regex.compile(
|
begin_pattern = regex.compile(
|
||||||
r'^第[一二三四五六七八九十百千]+(?:章|部分).*?(资格审查).*', regex.MULTILINE
|
r'^第[一二三四五六七八九十百千]+(?:章|部分).*?(资格审查).*', regex.MULTILINE
|
||||||
)
|
)
|
||||||
@ -238,10 +238,10 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
|
|||||||
r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+', regex.MULTILINE
|
r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+', regex.MULTILINE
|
||||||
)
|
)
|
||||||
local_output_suffix = "qualification1"
|
local_output_suffix = "qualification1"
|
||||||
elif selection == 4:
|
elif selection == 4: #投标人须知:前附表+须知正文
|
||||||
path1, path2 = extract_pages_tobidders_notice(pdf_path, output_folder, begin_page, common_header, invalid_endpage)
|
path1, path2 = extract_pages_tobidders_notice(pdf_path, output_folder, begin_page, common_header, invalid_endpage)
|
||||||
return [path1 or "", path2 or ""]
|
return [path1 or "", path2 or ""]
|
||||||
elif selection == 5:
|
elif selection == 5: #采购需求
|
||||||
begin_pattern = regex.compile(
|
begin_pattern = regex.compile(
|
||||||
r'^第[一二三四五六七八九十百千]+(?:章|部分).*?(?:服务|项目|商务|技术|供货).*?要求|'
|
r'^第[一二三四五六七八九十百千]+(?:章|部分).*?(?:服务|项目|商务|技术|供货).*?要求|'
|
||||||
r'^第[一二三四五六七八九十百千]+(?:章|部分)(?!.*说明).*(?:采购.*?(?:内容|要求|需求)|(招标|项目)(?:内容|要求|需求)).*|'
|
r'^第[一二三四五六七八九十百千]+(?:章|部分)(?!.*说明).*(?:采购.*?(?:内容|要求|需求)|(招标|项目)(?:内容|要求|需求)).*|'
|
||||||
@ -251,7 +251,7 @@ def truncate_pdf_main_goods(input_path, output_folder, selection,logger, output_
|
|||||||
r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+',regex.MULTILINE
|
r'^第[一二三四五六七八九十百千]+(?:章|部分)\s*[\u4e00-\u9fff]+',regex.MULTILINE
|
||||||
)
|
)
|
||||||
local_output_suffix = "procurement"
|
local_output_suffix = "procurement"
|
||||||
elif selection == 6:
|
elif selection == 6: #除去投标文件格式之前的内容 或者 合同条款之前的内容
|
||||||
invalid_path, end_page = get_invalid_file(pdf_path, output_folder, common_header, begin_page)
|
invalid_path, end_page = get_invalid_file(pdf_path, output_folder, common_header, begin_page)
|
||||||
return [invalid_path or "", end_page]
|
return [invalid_path or "", end_page]
|
||||||
else:
|
else:
|
||||||
|
@ -656,7 +656,7 @@ def get_technical_requirements(invalid_path, processed_filepath, model_type=1):
|
|||||||
else:
|
else:
|
||||||
# 第一步:收集需要调用 `continue_answer` 的问题和解析结果
|
# 第一步:收集需要调用 `continue_answer` 的问题和解析结果
|
||||||
questions_to_continue = [] # 存储需要调用 continue_answer 的 (question, parsed)
|
questions_to_continue = [] # 存储需要调用 continue_answer 的 (question, parsed)
|
||||||
max_tokens = 3900 if model_type == 1 else 5900
|
max_tokens = 8100 if model_type == 1 else 5900 #plus的max_tokens为8192 qianwen-long为6000,这里稍微取小一点,如果换doubao只有4000!!!
|
||||||
for question, response in results:
|
for question, response in results:
|
||||||
message = response[0]
|
message = response[0]
|
||||||
parsed = clean_json_string(message)
|
parsed = clean_json_string(message)
|
||||||
@ -674,7 +674,7 @@ def get_technical_requirements(invalid_path, processed_filepath, model_type=1):
|
|||||||
# 更新原始采购需求字典
|
# 更新原始采购需求字典
|
||||||
final_res = combine_and_update_results(modified_data, temp_final)
|
final_res = combine_and_update_results(modified_data, temp_final)
|
||||||
ffinal_res = main_postprocess(final_res)
|
ffinal_res = main_postprocess(final_res)
|
||||||
ffinal_res["货物列表"] = good_list
|
ffinal_res["货物列表"] = good_list #这里会将需要采购的货物列表带出来
|
||||||
# 输出最终的 JSON 字符串
|
# 输出最终的 JSON 字符串
|
||||||
return {"采购需求": ffinal_res}
|
return {"采购需求": ffinal_res}
|
||||||
|
|
||||||
|
@ -12,8 +12,8 @@ def fetch_procurement_reqs(procurement_path, invalid_path):
|
|||||||
#procurement_path可能是pdf\docx
|
#procurement_path可能是pdf\docx
|
||||||
# 定义默认的 procurement_reqs 字典
|
# 定义默认的 procurement_reqs 字典
|
||||||
DEFAULT_PROCUREMENT_REQS = {
|
DEFAULT_PROCUREMENT_REQS = {
|
||||||
"采购需求": {},
|
"采购需求": {}, #对具体的货物采购要求 技术参数
|
||||||
"技术要求": [],
|
"技术要求": [], #对供应商的技术、商务 服务要求 ,而不是对具体货物
|
||||||
"商务要求": [],
|
"商务要求": [],
|
||||||
"服务要求": [],
|
"服务要求": [],
|
||||||
"其他要求": []
|
"其他要求": []
|
||||||
@ -46,9 +46,9 @@ def fetch_procurement_reqs(procurement_path, invalid_path):
|
|||||||
# 使用 ThreadPoolExecutor 并行处理 get_technical_requirements 和 get_business_requirements
|
# 使用 ThreadPoolExecutor 并行处理 get_technical_requirements 和 get_business_requirements
|
||||||
with concurrent.futures.ThreadPoolExecutor() as executor:
|
with concurrent.futures.ThreadPoolExecutor() as executor:
|
||||||
# 提交任务给线程池
|
# 提交任务给线程池
|
||||||
future_technical = executor.submit(get_technical_requirements, invalid_path, processed_filepath, tech_model_type)
|
future_technical = executor.submit(get_technical_requirements, invalid_path, processed_filepath, tech_model_type) #采购需求
|
||||||
time.sleep(0.5) # 保持原有的延时
|
time.sleep(0.5) # 保持原有的延时
|
||||||
future_business = executor.submit(get_business_requirements, procurement_path, processed_filepath, busi_model_type)
|
future_business = executor.submit(get_business_requirements, procurement_path, processed_filepath, busi_model_type) #技术、商务、服务、其他要求
|
||||||
# 获取并行任务的结果
|
# 获取并行任务的结果
|
||||||
technical_requirements = future_technical.result()
|
technical_requirements = future_technical.result()
|
||||||
business_requirements = future_business.result()
|
business_requirements = future_business.result()
|
||||||
|
BIN
md_files/0.png
Normal file
After Width: | Height: | Size: 40 KiB |
BIN
md_files/1.png
Normal file
After Width: | Height: | Size: 27 KiB |
BIN
md_files/10.png
Normal file
After Width: | Height: | Size: 82 KiB |
BIN
md_files/11.png
Normal file
After Width: | Height: | Size: 4.1 KiB |
BIN
md_files/12.png
Normal file
After Width: | Height: | Size: 29 KiB |
BIN
md_files/13.png
Normal file
After Width: | Height: | Size: 27 KiB |
BIN
md_files/14.png
Normal file
After Width: | Height: | Size: 29 KiB |
BIN
md_files/16.png
Normal file
After Width: | Height: | Size: 43 KiB |
BIN
md_files/17.png
Normal file
After Width: | Height: | Size: 23 KiB |
BIN
md_files/2.png
Normal file
After Width: | Height: | Size: 27 KiB |
BIN
md_files/3.png
Normal file
After Width: | Height: | Size: 22 KiB |
BIN
md_files/4.png
Normal file
After Width: | Height: | Size: 14 KiB |
BIN
md_files/5.png
Normal file
After Width: | Height: | Size: 51 KiB |
BIN
md_files/6.png
Normal file
After Width: | Height: | Size: 12 KiB |
BIN
md_files/7.png
Normal file
After Width: | Height: | Size: 20 KiB |
BIN
md_files/8.png
Normal file
After Width: | Height: | Size: 27 KiB |
BIN
md_files/9.png
Normal file
After Width: | Height: | Size: 7.0 KiB |