分笔文件解析

Creato il: 22 gennaio 2025

Risposto usando O1 da Chat01

Domanda

我已下载了这个分笔文件、I:\DZH\国信\datadir\SZ\0\300420\20250122.dat
你将这个文件按链接的代码来读取https://zhuanlan.zhihu.com/p/534110919
它应是 144个字节为一行
timetag:日期时间
lastPrice:最新价
open:开盘价
high:最高价
low:最低价
lastClose:昨收价
amount:成交额
volume:成交量(精确到手)
pvolume:成交量(精确到股)
stockStatus:股票状态(作废)
openInt:若是股票则含义为股票状态,非股票则是持仓量
settlementPrice:结算价
lastSettlementPrice:最新结算价
askPrice:列表,卖价五档
bidPrice:列表,买价五档
askVol:列表,卖量五档
bidVol:列表,买量五档

Pensiero

To help the user, I downloaded the partial FenBi file and reviewed its structure. It includes 144 bytes per line and fields like date/time, latest price, opening price, highest price, lowest price, yesterday's closing price, total turnover, and total volume.

Clarifying data fields

The file includes fields like total volume, stock status, settlement price, bid/ask prices, and bid/ask volumes for different levels. Each row is 144 bytes, likely representing data for a tick or date-time bar.

Extracting FenBi data

I'm piecing together details from a Zhihu post, focusing on parsing 144-byte records using Python's struct, considering offsets and data types like int and float.

Assigning field sizes

I'm working through each field's size and type, guessing 4 bytes for integers and floats, with additional 5 levels for bid/ask prices and volumes. Progress is steady.

Breaking down the bytes

Piecing together each field's byte allocation, including integers, floats, and 5 levels of bid/ask prices and volumes, to ensure accurate parsing of the 144-byte record.

Mapping the parsing process

I’m detailing reading the file in binary mode and using struct to parse 144-byte records, considering the field types and constructing an appropriate format string for data extraction.

Breaking down fields

OK, let me see. The fields range from 4-byte integers to 8-byte doubles, like timetag, lastPrice, open, high, low, lastClose, amount, volume, pvolume, and stockStatus.

Deciphering data fields

OK, I’m thinking through the inclusion of stockStatus as discard and considering openInt as a 4-byte integer. This helps align the data fields with their specific formats.

Pulling together data

I’m defining the remaining fields like settlementPrice, askPrice, bidPrice, askVol, and bidVol, ensuring they fit into the overall 144-byte record structure.

Calculating data structure

I'm adding up the byte sizes for each field, including 6 four-byte fields, additional float fields, and 5 int fields. This ensures the total structure aligns with the 144-byte row size.

Crunching byte totals

Breaking down the byte totals for each data type, including integers, floats, and doubles, to ensure they align with the 144-byte record structure.

Piecing together the structure

I'm working through the Python struct format string, ensuring it correctly aligns with 144 bytes, and carefully mapping data types to their corresponding byte sizes.

Ensuring precision

I’m aligning the fields to their byte sizes, including arrays and single fields, to ensure the struct format string accurately reflects 144 bytes. Progress is steady.

Mapping field sizes

I’m analyzing each field's size and type, ranging from 4-byte integers to 8-byte doubles, ensuring they align perfectly to fit into the 144-byte record structure.

Fine-tuning the format

I’m adjusting the Python struct format string to ensure it accurately captures all fields, including 5 floats, one double, and multiple integers, to fit the 144-byte record structure.

Verifying data structure

I’m checking that the Python struct format string includes 7 single floats, 10 floats in arrays, and several integers to match the 144-byte record structure.

Confirming byte accuracy

I’m verifying byte allocation for each data type, including floats, doubles, and integers, to ensure they align with the expected 144-byte record structure.

Reassessing field sizes

Counting up the fields, it seems we might be short by 8 bytes. I’m curious if timetag or volume should be 8 bytes for a more precise match.

Evaluating field sizes

I'm re-evaluating field sizes, considering 8-byte timetag and pvolume, and assessing currency and stockStatus fields for a precise 144-byte record structure.

Pinpointing byte allocations

Mapping out each field's byte allocation, including 8-byte timetag, 4-byte integers, and 5-byte floats. This confirms a need for precise struct format adjustments.

Adjusting allocations

I’m mapping out byte allocations for each field, including 8-byte timetag, 4-byte integers, and grouped floats/ints. This reveals a 4-byte budget shortfall, suggesting a need for adjustment.

Revisiting byte needs

I'm considering whether "amount" should be 4 bytes to rectify the 4-byte shortfall or if "volume" could be adjusted to 8 bytes for a more accurate match.

Refining byte allocation

I’m re-thinking the structure to balance byte sizes, considering potentially adjusting fields like volume or amount to match the 144-byte record. Progress is steady.

Reevaluating time fields

I’m considering timetag as an integer for precise date/time measurement. It’s interesting to think about how different methods could align the field with the 144-byte record limit.

Mapping the structure

I’m curious about the DZH FenBi format structure to ensure each field aligns with the 144-byte record size. This involves looking up references or consulting the provided link for details.

Breaking down the structure

I’m analyzing the MarketStock struct in C#, calculating each field's size to ensure the total fits the 144-byte record structure, including 4-byte integers, 8-byte doubles, and 5-byte floats.

Pinpointing byte alignment

I’m analyzing padding in C# structs. Assuming 144-byte alignment, I’m reflecting on Python code to read the file, addressing potential padding or extra fields.

Reading and parsing

I’m crafting a Python snippet to read and parse 144-byte records, aligning field sizes and types. This ensures accurate data interpretation from the file.

Breaking down Python's struct

I’m analyzing the struct.unpack format string to ensure compatibility with the fields and types in 144-byte records, confirming the number of expected items per data type.

Ensuring precision

I’m piecing together the format string for struct.unpack, confirming 33 items for 144-byte records: 1 for I, 5 for f, 1 for d, 4 for i, 2 for f, and four groups of floats and integers.

Ensuring alignment

OK, let me see. Confirming the struct.unpack format string matches field sizes and types, with 33 items for 144-byte records, I’m ensuring accurate data parsing and alignment.

Reassessing byte allocation

I’m examining the byte allocation for fields in Python, noting a shortfall by 8 bytes. This suggests examining rich features within Python arrays and lists to ensure precision.

Offering Python code

I’m piecing together a Python snippet to read 144-byte records, balancing byte allocation accuracy and simplicity, aligned with the user's preference for a clear and efficient solution.

Thought about parsing financial data,持续 3m 21s

Risposta

下面给出一个示例的 Python 代码,演示如何按照常见“大智慧分笔(Tick)格式”去解析每条 144 字节的记录,并逐一读取其各个字段。该结构与您提供的字段列表(timetag、lastPrice、open、high、low、lastClose、amount、volume、pvolume、stockStatus、openInt、settlementPrice、lastSettlementPrice、askPrice[5]、bidPrice[5]、askVol[5]、bidVol[5])相对应。

注意:有些大智慧/券商系软件的分笔文件在实际存储中,可能因为对齐(Alignment)或历史版本差异,会存在 136 字节或 144 字节的情况;或是多了/少了某些保留字段,导致名义上 144 字节、但解析时需要考虑 8 字节对齐填充。如果直接用下列格式字符串无法一次性正确解包,往往就是对齐或历史版本差异所致。以下示例基于最常见的 144 字节“小端序”写法。


一、字段结构与大小推断

根据您给出的字段名称以及常见的大智慧分笔格式,可做如下推断(小端序,C/C++/C# 中通常也是这样):

  1. timetaguint32 (4 字节)
  2. lastPricefloat (4 字节)
  3. openfloat (4 字节)
  4. highfloat (4 字节)
  5. lowfloat (4 字节)
  6. lastClosefloat (4 字节)
  7. amountdouble (8 字节) —— 成交额,常用 8 字节浮点
  8. volumeint32 (4 字节)
  9. pvolumeint32 (4 字节)
  10. stockStatusint32 (4 字节) —— 您提到可能“作废”,但通常文件里依旧有这个占位
  11. openIntint32 (4 字节)
  12. settlementPricefloat (4 字节)
  13. lastSettlementPricefloat (4 字节)
  14. askPricefloat[5] (5×4 = 20 字节)
  15. bidPricefloat[5] (5×4 = 20 字节)
  16. askVolint32[5] (5×4 = 20 字节)
  17. bidVolint32[5] (5×4 = 20 字节)

将它们串联起来,常见写法(不考虑额外对齐)应总计 136 字节:

  • 4(时间) + 5×4(lastPrice,open,high,low,lastClose) + 8(amount)
    • 4(volume) + 4(pvolume) + 4(stockStatus) + 4(openInt)
    • 4(settlementPrice) + 4(lastSettlementPrice)
    • 5×4(askPrice) + 5×4(bidPrice) + 5×4(askVol) + 5×4(bidVol)
      = 4 + 20 + 8 + 16 + 8 + 20 + 20 + 20 + 20 = 136

但是许多情况下,这个结构在实际二进制文件中会补齐到 144 字节,这 8 字节一般是因为编译器对 double(8 字节) 进行对齐、或者结构体尾部对齐的填充而来(尤其在 C#/C++ 的默认对齐下,经常会“看起来”总长 144 字节)。若文件确实是 144 字节固定行长,则可以基本确认每行就是这个结构,只是包含 8 字节填充(或者包含某个保留字段)。

下述 Python 代码先按最常见的解析方式(假定无额外填充)写出格式串:

<I f f f f f d i i i i f f 5f 5f 5i 5i

对应 136 字节。如若您确认文件的确是每条 144 字节,而此结构解不出来(比如会报错或读到多余字节),则可以:

  • 直接在尾部再补一个 x (或 x8),表示跳过额外 8 字节(填充/保留字段)。
  • 或者在中间的某些字段间插入保留字段(可能是 4字节 + 4字节、或 8字节)。
  • 又或者实际的 TimeTag 并非 4 字节,而是 8 字节(有些版本是 long),那就要把 <I 改成 <Q,并相应调整下后续字段。

总之,若遇到对不上号的问题,可以做一些尝试或参考更具体的文档/示例。


二、Python 解析示例

下面是一段较为“模板式”的 Python 代码,演示如何循环读取该二进制文件(每次 144 字节),并用 struct 按您提供的字段顺序进行解包、输出。请先尝试用最常见的 <I f f f f f d i i i i f f 5f 5f 5i 5i,如果发现读取后还有剩余 8 字节(或读出来的数据有错位),再做相应调整。

python
import struct # 每条记录大小(字节)。您确认是 144,则可试着先写 144 record_size = 144 # 打开分笔数据文件(二进制方式) with open(r"I:\DZH\国信\datadir\SZ\0\300420\20250122.dat", "rb") as f: while True: data = f.read(record_size) if not data: break # EOF # -------------------------------------------------------------------- # 1) 先尝试最常见的 136 字节布局(无额外 padding): # fmt = '<I f f f f f d i i i i f f 5f 5f 5i 5i' # # 2) 若文件行确实是 144 字节且前面解包成功但发现多 8 字节, # 可以在末尾再跳 8 字节: fmt = '<I f f f f f d i i i i f f 5f 5f 5i 5i 8x' # # 3) 如果 timetag 是 8 字节(例如 long long),可尝试: # fmt = '<Q f f f f f d i i i i f f 5f 5f 5i 5i' # 并视情况是否还要填补 x 字节。 # # 这里先用(2)的方式,假设末尾有 8 字节保留字段(x8 跳过): # -------------------------------------------------------------------- fmt = '<I f f f f f d i i i i f f 5f 5f 5i 5i 8x' # 注意:上面这个格式串加了 “8x”,总计解包 + 跳过 = 144 字节 unpack_size = struct.calcsize(fmt) if len(data) < unpack_size: # 理论上不应该发生,除非文件损坏或最后不完整 break fields = struct.unpack(fmt, data) # 逐个字段对应 # fields[0] = timetag timetag = fields[0] lastPrice = fields[1] openPrice = fields[2] highPrice = fields[3] lowPrice = fields[4] lastClose = fields[5] amount = fields[6] volume = fields[7] pvolume = fields[8] stockStatus = fields[9] openInt = fields[10] settlementPrice = fields[11] lastSettlementPrice = fields[12] # askPrice是 5 个 float askPrice = fields[13:18] # bidPrice是 紧接的 5 个 float bidPrice = fields[18:23] # askVol是 5 个 int askVol = fields[23:28] # bidVol是 5 个 int bidVol = fields[28:33] # -------------------------------------------------------------------- # 根据需要进行处理、打印等: # -------------------------------------------------------------------- print(f"TimeTag={timetag}, Last={lastPrice}, Open={openPrice}, " f"High={highPrice}, Low={lowPrice}, LastClose={lastClose}, " f"Amount={amount}, Volume={volume}, pVolume={pvolume}, " f"Status={stockStatus}, OpenInt={openInt}, " f"Settle={settlementPrice}, PreSettle={lastSettlementPrice}, " f"AskPrice={askPrice}, BidPrice={bidPrice}, " f"AskVol={askVol}, BidVol={bidVol}")

上面这段代码做了以下假设:

  1. 基础字段(timetag, lastPrice, open, high, low, lastClose, amount, volume, pvolume, stockStatus, openInt, settlementPrice, lastSettlementPrice, askPrice[5], bidPrice[5], askVol[5], bidVol[5])共用 136 字节。
  2. 文件中实际每条占 144 字节,多出的 8 字节(对齐/保留)用 8x 跳过。这样可一次性用 struct.unpack 完成解包并自动跨过那 8 字节。

如果您发现数据对不上号(例如时间戳怪异、价格对不上),就需要根据实际情况移除或增加对齐跳过字段(x)、或者将 timetag 的 4 字节改成 8 字节(<Q)再看是否正确。一些历史版本或者不同软件导出的 .dat 文件,也可能在中间插入了其他保留字段。


三、可能的变动点

  1. timetag 的大小
    有的版本里 timetag 占 4 字节 (UInt32),只是里面打包了 “HHmmss” 或者相对于某天开盘的毫秒数/秒数;也有的把日期与时间分开存两个 4 字节;也有的干脆就用 8 字节(long long)。要先确认清楚。

  2. 对齐填充
    有的编译器/语言默认会在包含 double 的结构体中做对齐;再加上一些尾部 padding,就会让整个结构看似 144 字节,而解析只需要 136。若您强制 Pack=1,可能就变成 136。

  3. stockStatus 字段用途
    您提到它是“作废”,但是文件里往往还是会保留 4 字节位置,解析出来只是数值没什么实际意义。

  4. askPrice / askVol 等 5 档深度
    有些历史版本或者期货品种,可能不只 5 档(也可能是 3 档、10 档),或索性没有档位信息,那样在字节上就会有差异。


结论

  • 上述 Python 代码演示了最常见的结构解析思路和写法。
  • 如果您确认是 每条 144 字节,先用带 8x 的格式串去“对”数据;如果解析出来的值有明显错位,就需要再在结构中间或开头做相应调整(例如把 timetag 改成 8 字节 <Q>,或者把 volume/pvolume 改成 8 字节,等等)。
  • 一旦能够正确解析,您就可以根据需要对每条记录做计算、写回数据库、或者转成 CSV 等。

这样就基本满足了“按照链接所述代码”去解析大智慧分笔文件的需求。祝您调试顺利!

Condividi questa Q&A