importing thousands of text files - unknown encodings

shawnkhall
Posts: 30
Joined: Sun Oct 20, 2013 2:17 am
Contact:

importing thousands of text files - unknown encodings

Postby shawnkhall » Wed Nov 27, 2013 11:14 pm

I have a strange situation and I'm hoping there's a workaround somehow. If this is too far off topic, I understand. Thanks in advance for any help anyone can provide.

I find myself loving CN so much that I want to try to save some major disk space and resources by dropping a current directory-based flat-file system into CN. Ideally, this will import the existing data. I'm trying to use txtdir2xml app to build the import file, but it's providing an error:

Code: Select all

Traceback (most recent call last):
  File "c:\python32\lib\site-packages\cx_Freeze\initscripts\Console3.py", line 27, in <module>
  File "txtdir2xml.py", line 100, in <module>
  File "txtdir2xml.py", line 17, in main
  File "txtdir2xml.py", line 44, in convert
  File "txtdir2xml.py", line 56, in convertFiles
  File "txtdir2xml.py", line 69, in addTextFileToXml
  File "C:\Python32\lib\codecs.py", line 300, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte


I was prepared to split it into 256 groups and then import them one by one with:

Code: Select all

FOR /L %k IN (0,1,255) DO echo txtdir2xml -r -e utf-8 %k ips-%k.xml


The error doesn't indicate which file is experiencing problems, so it would take a lot of time to try to figure it out manually with the amount of files I'm looking at.

Oh, about that...the files I'm importing are IP data files generated from Nirsoft's WhosIP utility, using the -r (recursive) command line option. This output is then piped to a txt file on disk, as so:

Code: Select all

whosip -r 1.2.3.4 > c:\temp\1\2\3\1.2.3.4.txt


So anyway... I have roughly 70k files to import, consisting of 240mb of actual data and 8+gb of physical disk storage (thanks to the block size being significantly larger than the actual files). I know that porting this data into SQLite will save me a bunch of space and performance accessing and parsing the information, but I figure I may as well go all the way and put it in CN for the additional tagging and GUI stuff. I also haven't found any other easy way to port the data directly from files into SQLite, and CN seems as good as any option to make that happen.

If anyone can think of a SIMPLE way to make this happen, I would really appreciate it. Otherwise I'm going to have to write a script that'll do it.

Thanks again!
shawnkhall
Posts: 30
Joined: Sun Oct 20, 2013 2:17 am
Contact:

Re: importing thousands of text files - unknown encodings

Postby shawnkhall » Wed Nov 27, 2013 11:17 pm

Oh, and I've tried both utf-8 and utf-16. Both return errors in the txtdir2xml, though the error is slightly different. Utf-8 reports 0xff is invalid and utf-16 reports the file as not starting with a BOM.
Mark S.
Posts: 81
Joined: Thu Aug 09, 2012 3:39 pm
Contact:

Re: importing thousands of text files - unknown encodings

Postby Mark S. » Thu Nov 28, 2013 2:22 am

I always get confused by UTF stuff, but I have some guesses.

If all your data was generated the same way, then it is all using the same encoding.

The coding being used in your files is ASCII but with additional assignments for bits 128-256.

So, your input is neither UTF-8 nor UTF-16, but ASCII with characters above bit 127.

What's needed is an additional forced ASCII mode in the conversion utility which would assume the incoming text was always ASCII, no matter what the first bit was set to. This might mean that your data looks a little different in the imported edition.

If you have python on your system, then I THINK that changing this line in the python code:

file = open(filePath, encoding = encoding)

to

file = open(filePath, "iso-8859-1")

might be an extremely lazy way of forcing the code to do what you want. Er, I assume you have backups?

Good luck!
Mark
shawnkhall
Posts: 30
Joined: Sun Oct 20, 2013 2:17 am
Contact:

Re: importing thousands of text files - unknown encodings

Postby shawnkhall » Thu Nov 28, 2013 4:59 am

Thanks for the response.

Unfortunately, different NICs have different encodings and different objectives. KRNIC is always UTF-16, ARIN is *usually* ASCII, but even within ARIN you have different ISPs and hosting providers that return data in completely different formats in their own RWHOIS systems. Encoding is entirely random. I'm thinking it might be easiest to go ahead and convert it to SQLite directly with direct hex-encoding (X'...') format and then port it out from there.

Oh, and backups? Absolutely. One of my goals with dropping this into a real database is to make backups easier - right now it takes just over half an hour to enumerate the folders and collect the files to tgz. Sigh. I could drop that to less than a minute once it's in SQLite format.
User avatar
CintaNotes Developer
Site Admin
Posts: 5001
Joined: Fri Dec 12, 2008 4:45 pm
Contact:

Re: importing thousands of text files - unknown encodings

Postby CintaNotes Developer » Fri Nov 29, 2013 10:39 am

shawnkhall wrote:I find myself loving CN so much that I want to try to save some major disk space and resources by dropping a current directory-based flat-file system into CN.


Really glad to hear of such a devotion ;)

shawnkhall wrote:Ideally, this will import the existing data. I'm trying to use txtdir2xml app to build the import file, but it's providing an error:


It seems that some of your files use UTF8, some UTF16, while maybe others use some other encoding.
So you need to bring all your txt files to one encoding first. To do that, you can use advice from this article. After that txtdir2xml should work ok.

So anyway... I have roughly 70k files to import, consisting of 240mb of actual data and 8+gb of physical disk storage (thanks to the block size being significantly larger than the actual files). I know that porting this data into SQLite will save me a bunch of space and performance accessing and parsing the information, but I figure I may as well go all the way and put it in CN for the additional tagging and GUI stuff. I also haven't found any other easy way to port the data directly from files into SQLite, and CN seems as good as any option to make that happen.

Oh, CN database of 280mb in size.. never tried it, but I guess it will be VERY slow, at least for search.
IMO it would be better to split your files into several notebooks (and then maybe even further into different sections) based on some criteria.

Please keep us posted how it finally worked out for you!
Alex
shawnkhall
Posts: 30
Joined: Sun Oct 20, 2013 2:17 am
Contact:

Re: importing thousands of text files - unknown encodings

Postby shawnkhall » Fri Nov 29, 2013 11:51 am

CintaNotes Developer wrote:It seems that some of your files use UTF8, some UTF16, while maybe others use some other encoding. So you need to bring all your txt files to one encoding first. To do that, you can use advice from this article. After that txtdir2xml should work ok.


Fantastic! This appears to be working. It looks like it's going to take several hours to convert the files (running about 30 minutes, converted 8000 files), so I may as well go to bed now.

Oh, CN database of 280mb in size.. never tried it, but I guess it will be VERY slow, at least for search.

It can't possibly be as slow as it is now. I use Agent Ransack to make it go faster and it's still usually a 20 minute delay to get more than the filename out. Most access is a simple blob pull and freeform rx against that for an explicit IP record. This ought to be quite a bit faster than it is now.

Really, though, search isn't my goal in this. The data is used to be able to accurately determine netblock size and scope for filtering botnets and infected or evil organizations, and the data will ultimately be processed in an existing tool I wrote for that purpose a few years ago. I also would like to be able to assign class size dynamically in the raw metadata instead of having to parse it each time at runtime.

I really just need a well-formed structure for data storage so that processing the data isn't wasting resources unnecessarily and absolutely NO unnecessary requests are sent to the respective NICs, as some of them are, well, a PITA when it comes to asking them for data. LACNIC, for example, will refuse connections if you ask for more than one record within 6 seconds and will block you if you request more than 3000 records within a 24 hour window.

Since the very nature of these botnets is that they tend to have frequent infections from repeat IPs, if I didn't have a local cache of the NIC data to refer to I would be blocked by most NICs all the time. In any given day I process between 2k and 15k records, and the vast majority are repeats. I currently cache results for 60 days, and if I didn't, there's just no way that I'd be able to collect and process the data at all.

I'm going to try using CN initially, but I'm afraid that the data scope might be insufficient. In any case, it'll definitely get me closer to the final goal - and all effort eventually pays off.

Please keep us posted how it finally worked out for you!


Thanks, I will. :)
User avatar
CintaNotes Developer
Site Admin
Posts: 5001
Joined: Fri Dec 12, 2008 4:45 pm
Contact:

Re: importing thousands of text files - unknown encodings

Postby CintaNotes Developer » Fri Nov 29, 2013 12:51 pm

shawnkhall wrote:Fantastic! This appears to be working. It looks like it's going to take several hours to convert the files (running about 30 minutes, converted 8000 files), so I may as well go to bed now.

Nice)

I'm going to try using CN initially, but I'm afraid that the data scope might be insufficient. In any case, it'll definitely get me closer to the final goal - and all effort eventually pays off.

Thanks for the detailed explanation. Well if you think that trying CN here is worth the effort, no problem. At the same time it will be really good stress testing for CN, I'm curious how it fares :D
Alex
User avatar
CintaNotes Developer
Site Admin
Posts: 5001
Joined: Fri Dec 12, 2008 4:45 pm
Contact:

Re: importing thousands of text files - unknown encodings

Postby CintaNotes Developer » Thu Dec 26, 2013 9:03 am

Hi shawnkhall,

I'm curious did it eventually work out for you? )
Alex
shawnkhall
Posts: 30
Joined: Sun Oct 20, 2013 2:17 am
Contact:

Re: importing thousands of text files - unknown encodings

Postby shawnkhall » Sun Mar 30, 2014 3:58 am

I'm afraid not. Every time I tried to perform the import it would stall and eventually lock up. I suspect there's just too much data. :(
User avatar
CintaNotes Developer
Site Admin
Posts: 5001
Joined: Fri Dec 12, 2008 4:45 pm
Contact:

Re: importing thousands of text files - unknown encodings

Postby CintaNotes Developer » Wed Apr 09, 2014 11:27 am

Sorry to hear that. BTW in the latest version we've fixed the import function so it at least reports what's wrong.
So if you still have that XML and could re-run the import with CN 2.5.2, I'd be most obliged!
Thanks!
Alex

Return to “CintaNotes Personal Notes Manager”