walthervonstolzing

joined 2 years ago
[–] [email protected] 1 points 1 week ago
[–] [email protected] 5 points 4 months ago

Santagate 2019 Pro for Workgroups

[–] [email protected] 27 points 7 months ago

That is a great change to the papers of the past where you have to have an affiliation to a university to get access to a paper and sometimes even that is not enough.

'Oxford Scholarship Online' would license different sets of books to different departments; so someone from the philosophy department couldn't get access to books classified under sociology or history.

Imagine doing something similar at the checkout table in a 'physical' library.

[–] [email protected] 8 points 7 months ago

Here's another video: https://www.youtube.com/watch?v=PriwCi6SzLo (including an interview with the great Alexandra Elbakyan).

Cory Doctorow recently wrote about this in some detail (incl. helpful links): https://pluralistic.net/2024/08/16/the-public-sphere/#not-the-elsevier

[–] [email protected] 2 points 7 months ago

The name of the pdf file inside the torrent is its md5 hashsum without the .pdf extension.

On libgen.rs you can see the md5 hashsum on the download page; on libgen.li you need to look at the JSON file provided at the link on the search result , as they don't render it on the ui.

[–] [email protected] 10 points 7 months ago (2 children)

The torrents are alive; as long as you can get the torrent links from libgen, you have access to the files. (No need to share whole archives either, you can pick & choose).

[–] [email protected] 5 points 7 months ago* (last edited 7 months ago) (1 children)

Wouldn't enabling the --system-site-packages flag during venv creation do exactly what the OP wants, provided that gunicorn is installed as a system package (e.g. with the distro's package manager)? https://docs.python.org/3/library/venv.html

Sharing packages between venvs would be a dirty trick indeed; though sharing with system-site-packages should be fine, AFAIK.

[–] [email protected] 1 points 11 months ago

Michael W. Lucas's "Networking for System Administrators" is a great resource: https://mwl.io/nonfiction/networking#n4sa

[–] [email protected] 1 points 1 year ago

That's not a consideration in favor of grouping h/j as the 'back keys', and k/l as the 'forward' keys, though. It's perfectly comfortable & intuitive to have the index finger on the key that goes forward.

[–] [email protected] 1 points 1 year ago* (last edited 1 year ago) (2 children)

Why, though? Why is it so obvious that j 'should have' been [edit: up]?

[–] [email protected] 3 points 1 year ago

Sure if you drag it through the garden.

[–] [email protected] 9 points 1 year ago* (last edited 1 year ago)

PyMuPDF is excellent for extracting 'structured' text from a pdf page — though I believe 'pulling out relevant information' will still be a manual task, UNLESS the text you're working with allows parsing into meaningful units.

That's because 'textual' content in a pdf is nothing other than a bunch of instructions to draw glyphs inside a rect that represents a page; utilities that come with mupdf or poppler arrange those glyphs (not always perfectly) into 'blocks', 'lines', and 'words' based solely on whitespace separation; the programmer who uses those utilities in an end-user facing application then has to figure out how to create the illusion (so to speak) that the user is selecting/copying/searching for paragraphs, sentences, and so on, in proper reading order.

PyMuPDF comes with a rich collection of convenience functions to make all that less painful; like dehyphenation, eliminating superfluous whitespace, etc. but still, need some further processing to pick out humanly relevant info.

Built-in regex capabilities of Python can suffice for that parsing; but if not, you might want to look into NLTK tools, which apply sophisticated methods to tokenize words & sentences.

EDIT: I really should've mentioned some proper full text search tools. Once you have a good plaintext representation of a pdf page, you might want to feed that representation into tools like the following to index them properly for relevant info:

https://lunr.readthedocs.io/en/latest/ -- this is easy to use, & set up, esp. in a python project.

... it's based on principles that are put to use in this full-scale, 'industrial strength' full text search engine: https://solr.apache.org/ -- it's a bit of a pain to set up; but python can interface with it through any http client. Once you set up some kind of mapping between search tokens/keywords/tags, the plaintext page, & the actual pdf, you can get from a phrase search, for example, to a bunch of vector graphics (i.e. the pdf) relatively painlessly.

view more: next ›