What's going on Everybody welcome to part two of our chat bot with Python and tensorflow tutorial series in this tutorial We're going to be doing is beginning to build our Database that's going to store our basically our parent comments – they're paired best reply comments So the reason why we really want to do something like this is because well first of all A lot of these files are way way way too big for us to just like read into RAM and then create the training files from even just individual months But chances are you're gonna want to eventually if you wanted to create a really Nice chat bot you're gonna be wanting to work on many months of data so maybe possibly billions of comments you do have that your disposal so when that's the case we Probably want to have some sort of database now for the purposes here.
Just to keep things simple We don't deal with my sequel servers or any other Big database server type thing I'm just gonna use SQLite It'll help us get the job done, but you can feel free to use pretty much whatever you want But I'm gonna use SQLite here now before we get too deep I just want to address kind of what what all our data should be looking like So I want to bring up here This is basically if you downloaded the reddit data, and you know extract it It should look something like this you should have years You know basically 2007 to 2015 again bigquery does have data all the way up to Recent like you know last month whatever that would be And also if you do go that route.
Just know that your formats going to be totally different than ours, so you'll have to adapt What you're doing? If you want to go that way, but possibly if someone does share some way to officially pull bigquery I'll probably append that to the end of this tutorial series because there's there's also multiple models that I'm working on with chatbots So I also just I'm pretty confident that there will be some follow up videos Anyways enough that if you click on any of these normally you'll just have all these compressed files But if you extract them, it looks like this basically and then these files contain just a bunch of samples each sample looks long this Alright, so this is just one sample as you can see there's a bunch of data here.
It's obviously a JSON It's key and value though so yeah, you know there's there's a lot of first of all wasted information here So just putting into a database will severely Decrease the size of this data right you just have one column name And then all the data like you're basically you know this much data becomes just this right that makes more sense Also, we don't need all of these we don't need like link ID for example. We don't really need name We might be interested and created. We're probably not interested in when it was received author flare We probably don't care about you might We probably don't and so on obviously we do care about like things like score and ups and downs and maybe if they were gilded Or not or stuff like that we might care about those especially like if trying to make some sort of a very specific bot Same thing with like the subreddit or something like that if you want it again to create some sort of really specific Type of chat bot for now I want it to be a fairly general But I care about at least score one thing to note though is I'm fairly confident score is miscalculated downs are always zero So score is always miscalculated if I recall right I can't remember if that's truly the case, but anyway It's really quick to test.
I forget all I know is that take it with a grain of salt because it's improper Anyway, I'm pretty sure it's the case that downs are always zero But I can't remember if you can take ups and then score You know basically ups – the actual score would equal the downs, or if score also always equals ups I can't remember But anyway just know there's some sort of flaw there anyway Let's continue so Working in Python now what I'm gonna. Go ahead and do is We're just gonna start building out the code that we're going to be using here So let's go ahead and import sqlite3 for our database import JSON And then we'll go from date/time import date time and really SQLite obviously database JSON to read that format basically and then date time we're really just going to use this to output where we are As we're kind of outputting just some some logging information just so we know where we are As you might imagine going through these huge files can take a lot of time so sometimes.
I just like to put simple Outputs that kind of tell us where we are at that at the time Moving along we're gonna say time frame. I'm gonna say we're going to use 2015. Oh 5 so remember the format of the files When you download them basically, so we're gonna basically be grabbing this one. They all have RC I don't know what our C stands for us probably not release candidate, but it's Reddit comments, maybe I don't know anyway I don't know what it stands for but anyways they all have that same format obviously.
This is May of 2015 so This is the one that we want Alternatively you could take lists of time frames and then iterate through them build the database the same way. I'm about to build the database so Once we've done that also I'm gonna have SQL Transaction we're gonna have this because you don't want to be in specially like when you know you're gonna be working with like millions of rows You don't want to insert rows one by one if you don't have to that's really inefficient Instead you want to build up a big transaction, and then do it all at once and it will be Just gobs faster, so that's what we're going to use that for Now what we're gonna.
Do is build out the connection that's going to be SQLite 3 dot connect We're gonna connect to something database not seeing the database that would still work though taht format And in time frame so this will just be a database called whatever the month and year is Again alternatively if you wanted what you probably could do is Like this well for example probably what we're in a color table is like Parent reply or something like that Instead you could actually make the database parent reply and then each table name could be the the month or something To me I don't really think the month and year is all that valuable like there's no real reason why you would separate those out So I'm not really gonna do that but you could if you wanted anyway Then we're gonna define our cursor, so that's just connection dots cursor Okay now we're gonna.
Go ahead and use creator table, so it's fine creates table And then this is just going to be your typical see to execute create table it not Exists and the tables to be parent reply and Then we're gonna have all of our columns so first of all we're gonna have parents ID and This or. This is gonna be text type and then also it'll be our primary key Yeah, this is gonna run way off the screen.
I think you can get away with a triple quote here We're gonna find out just So I don't have to run everything off the screen so much We'll see how it goes so yeah, so parent ID now we're going to need the comment Comment ID and that again that's going to be a text And a not a primary key But it should be unique unique And then we're gonna have parent & parent Will be also text text type and then we're gonna have the comment itself so the reply Comment will be text type I'd also like to go ahead and log the subreddit just simply because I do kind of see in the future That's gonna be a useful thing to be tracking different subreddits have different ways of talking And if you want a smarter soundings chatbot you could go with more scientific and engineering types of subreddits if you wanted a more Nevermind I'm not gonna get myself in trouble well.
We'll stop at that anyway. You could get different types of Chatbots the unix time that's just gonna be an integer and then finally we'll go ahead and take the score which also should be an int Okay so with that. We've got our query and of course I did just run it off the screen anyway But yeah so with that we should create the table if it doesn't exist so then what we do at the end here is if We'll just start our main loop here our main chunk.
I guess maybe name equals main Let's go to create table, so this will just create the table if it doesn't exist The other thing to note is if the database doesn't exist when you attempt to connect to it it creates a database that's why we didn't have to create any database that's obviously just SQLite and then finally This obviously will only create the table if it doesn't exist and so it's relatively cheap to run it So we'll go ahead and run that so That's all for now what we're gonna.
Do is in the next tutorial. We'll actually start working through I'm not sure if we'll be able to insert any of the data Too much because there's a lot of cleaning up of the data and stuff, but yeah in the next tutorial We'll at least start Buffering through the data and start kind of cleaning up that data and get it ready at least to insert it into the database Anyways, if you have any questions comments concerns whatever feel free to leave them below, otherwise. I will see you in the next tutorial.