如何解决雅典娜无法读取CSV字段中的多行文本
此athena表正确读取了文件的第一行。
CREATE EXTERNAL TABLE `test_delete_email5`(
`col1` string,`col2` string,`col3` string,`col4` string,`col5` string,`col6` string,`col7` string,`col8` string,`col9` string,`col10` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'serialization.format' = ',','field.delim' = ','LINES TERMINATED BY' = '\n','ESCAPED BY' = '\\','quoteChar' = '\"'
) LOCATION 's3://testme162/email_backup/email5/'
TBLPROPERTIES ('has_encrypted_data'='false')
由于在第5列中找到了html代码,因此未正确导入此表。还有其他办法吗?
解决方法
您的文件似乎在 #!/bin/sh
#\
exec sudo tclsh "$0" "$@"
# List KVM snapshots of all machines (domains) including their description
# Holger@Jakobs.com 2020-08-21
package require tdom
### Acquire list of machines (domains) from "virsh list --all"
set machines ""
foreach machineInfo [lrange [split [exec virsh list --all] \n] 2 end-1] {
set name [string trim [string range $machineInfo 7 37]]
set state [string trim [string range $machineInfo 38 end]]
dict set machines $name state $state
} ;# foreach
### Acquire list of snapshots for all machines (name,time and description)
foreach m [dict keys $machines] {
foreach snapshot [lrange [split [exec virsh snapshot-list --domain $m] \n] 2 end-1] {
set name [string trim [string range $snapshot 1 21]]
set xmlRoot [[dom parse [exec virsh snapshot-dumpxml --domain $m --snapshotname $name]] documentElement]
set descr [[$xmlRoot selectNodes /domainsnapshot/description/text()] data]
set creaTime [clock format [[$xmlRoot selectNodes /domainsnapshot/creationTime/text()] data] -format {%Y-%m-%d %H:%M}]
dict set machines $m snapshots $name time $creaTime
dict set machines $m snapshots $name descr $descr
} ;# foreach snapshot
} ;# foreach machine
### Output a list of all machines with their snapshots including time and description
foreach m [dict keys $machines] {
puts [format "\nMACHINE '%s' (%s)" $m [dict get $machines $m state]]
catch {unset snapshots}
dict with machines $m {
if [info exists snapshots] {
foreach sn [dict keys $snapshots] {
puts " SNAPSHOT '$sn',created: [dict get $snapshots $sn time]"
foreach line [split [dict get $snapshots $sn descr] \n] {
puts " $line"
}
} ;# foreach snapshot
} ;# if snapshot exists
} ;# dict with
} ;# foreach machine
puts ""
字段中包含许多多行文本。这不是CSV标准(至少,OpenCSVSerde无法理解)。
作为测试,我做了一个简单的文件:
textbody
- 第1行是标题
- 第2行是正常的
- 第3行的字段中包含
"newsletterid","name","format","subject","textbody","htmlbody","createdate","active","archive","ownerid" "one","two","three","four","five","six","seven","eight","nine","ten" "one","five \" quote \" five2","five \ five2","ten"
个转义引号 - 第4行逃脱了换行符
然后我从您的问题中运行命令,并将其指向该数据文件。
结果:
- 返回第1-3行(包括标题行)
- 第4行仅工作到
\"
为止,直到此后的数据丢失
底线:您的文件格式与CSV格式不兼容。
您可能能够找到一些可以处理它的Serde,但是OpenCSVSerde似乎不理解它,因为行通常由换行符分隔。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。