pds数据类型读取与pds4

近日处理月球火星等相关数据碰到的问题，本文中数据均是瞎填的数据，仅用于示意。另外头一回处理这种数据，肯定全是问题，如果有更好的办法欢迎讨论哇。

另外注意！！！！！只有.BL这种有L的数据是用pds4_tools库可以打开的，打开的就是对应的.B，不要直接打开.B！！！

报错汇总

首先，一般报错了肯定是数据有问题，应该清洗一下，但是批量洗数据有亿点子麻烦，且报错防不胜防的，所以本文后文通过直接改库解决问题。
（主要解决的是读取，展示，分析和看标签问题，真要做下一步处理格式问题一般没啥意义）

1.group不能整除问题
pds4_tools.utils.exceptions.PDS4StandardsException: Group length ‘999’ must be evenly divisible by the number of repetitions ‘100’ for group ‘HHHHHH’ (full location: ‘HHHHHH’)
这个的意思是在HHHHHH组中有一个重复操作，但是这个999没法分到100次里，很明显改数据的话就去L文件里找这个HHH组把999改成1000

2.空值问题
ValueError: Unable to convert field ‘HHHHHHH’ to data_type ‘ASCII_Integer’: “invalid literal for int() with base 10: b’’”
这个的意思是没办法把空值转化成整数，改数据的话就去找所有的空值，补一个9999或者0之类的上去

3.格式问题
ValueError: Unable to convert field ‘HHHHHHH’ to data_type ‘ASCII_Integer’: “invalid literal for int() with base 10: b’NUL’”
很明显，是这里没有整数值的时候数据补上的是NUL，和数据类型冲突了，因此要改数据类型。

ValueError: Unable to convert field ‘HHHHHHH’ to data_type ‘ASCII_Integer’: “invalid literal for int() with base 10: b’ 11.111’”
这种就是直接数据和类别不匹配。

背景

pds4是一种nasa开发的储存数据的方法，多用于存储行星数据，其等级分类如下表【来自参考5】：
通常是.B和.BL文件成对储存，相当于将数据存为了文件头和文件两部分，文件头【L】是xml语言编写的，里面包括数据存储的格式与文件的路径。
```
#一个xml文件示例

George
John
Reminder
Don't forget the meeting!
	
```
L文件就是一种标定用的文件，里面的标签可以完全自定义使用

目前可以通过pds4_tools实现对pds4数据的分析，代码如下：

    import pds4_tools
    path2clq='一个路径'
    pds_struct = pds4_tools.read(path2clq)
	pds4_tools.view(from_existing_structures=pds_struct)

可以调出一个窗口，示意如下【来自参考2】：

解决方法

首先，为保证稳定，对python包进行重命名：pds4_tools_HHH

改 _init_.py第一行，这个时候 pds4_read(path2cld, quiet=True)命令就已经可用了

from pds4_tools.__about__ import (__version__, __author__, __email__, __copyright__)

改为

from pds4_tools_HHH.__about__ import (__version__, __author__, __email__, __copyright__)

改 viewercache.py第268行，这个时候应该.view也能用了

return sys.modules['pds4_tools'].__version__

改为

return sys.modules['pds4_tools_HHH'].__version__

报错一：group+1

改 readertable_objects.py第1364行判断，加一个补一的尝试

                    if item['length'] % item['repetitions'] != 0:
                        raise PDS4StandardsException("Group length '{0}' must be evenly divisible by the "
                                                     "number of repetitions '{1}' for group {2}"
                                                     .format(item['length'], item['repetitions'],
                                                             full_location_warn))

改为

                    if item['length'] % item['repetitions'] != 0:
                        if (item['length']+1) % item['repetitions'] != 0:
                            raise PDS4StandardsException("Group length '{0}' must be evenly divisible by the "
                                                         "number of repetitions '{1}' for group {2}"
                                                         .format(item['length'], item['repetitions'],
                                                                 full_location_warn))
                        else:
                            item['length']=item['length'] + 1

空值补全和类型问题，全都出在 readerdata_types.py 路径下函数 def data_type_convert_table_ascii(data_type, data, mask_nulls=False, decode_strings=False)里面

补空，函数本身已经提供了这个接口，就是mask_nulls，因此在调用函数的时候加上就行，在readerread_tables.py下第929行

```
kwargs = {'decode_strings': False}
```
改为

kwargs = {'decode_strings': False,'mask_nulls':True}

通过已有的以下代码，对为空的数据进行补全：

        for i, datum in enumerate(data):
            if datum.strip() == b'':
                mask_array[i] = True

        data[mask_array] = six.ensure_binary(str(fill_value))

可解决以下报错

to data_type 'ASCII_Integer': "invalid literal for int() with base 10: b''"

也可以改一改fill_value的值试试

类型问题，主要指类型不匹配，这里直接列一个列表，不匹配就挨个试一下，都不行再报错

这里我们首先看一下函数 np.issubdtype，用于判断第一个类型与第二个类型相比是不是更低一级，或者同级，通过代码测试，很容易用于测试类别所属：
```
    type1=np.int
    type2=np.floating
    print(np.issubdtype(type1,type2))
```

观察代码，很明显，是按照布尔，浮点数，整数，字符四个类型分开处理的，在numpy中提供了24中类型用于描述标量

# 保存在这个位置：numpycorenumerictypes.py
generic = allTypes['generic']

genericTypeRank = ['bool', 'int8', 'uint8', 'int16', 'uint16',
                   'int32', 'uint32', 'int64', 'uint64', 'int128',
                   'uint128', 'float16',
                   'float32', 'float64', 'float80', 'float96', 'float128',
                   'float256',
                   'complex32', 'complex64', 'complex128', 'complex160',
                   'complex192', 'complex256', 'complex512', 'object']

按照包含信息的完整程度，我们将判断的流程改为bool,int,float,str单向，更改508行代码

```
data[i] = int(data[i], numeric_base)
```
改为

                try:
                    data[i] = int(data[i], numeric_base)
                except:
                    try:
                        data[i] = float(data[i])
                        dtype=np.floating
                    except:
                        data[i] = str(data[i])
                        dtype = np.character

这样之后我的数据就都能跑通了。

参考

【1】https://www.w3school.com.cn/xml/xml_intro.asp
【2】https://www.zhihu.com/question/48057070
【3】标量 | NumPy 中文
【4】48 python,numpy,pandas数据相互转换及数据类型转换；（汇总）（tcy）_tcy-阿春-CSDN博客_numpy pandas 转换
【5】https://zhuanlan.zhihu.com/p/106395591

pds数据类型读取与pds4

Python相关栏目本月热门文章